Journal of Machine Learning Research XX (2013) 1-46
Submitted 2/13; Published XX/XX
Reasoning about Independence
in Probabilistic Models of Relational Data
Marc Maier maier@cs.umass.edu
Katerina Marazopoulou kmarazo@cs.umass.edu
David Jensen jensen@cs.umass.edu
School of Computer Science
University of Massachusetts Amherst
Amherst, MA 01003, USA
Editor:
Abstract
Bayesian networks leverage conditional independence to compactly encode joint probability
distributions. Many learning algorithms exploit the constraints implied by observed condi-
tional independencies to learn the structure of Bayesian networks. The rules of rf-separation
provide a theoretical and algorithmic framework for deriving conditional independence facts
from model structure. However, this theory only applies to Bayesian networks. Many real-
world systems, such as social or economic systems, are characterized by interacting hetero-
geneous entities and probabilistic dependencies that cross the boundaries of entities. Con-
sequently, researchers have developed extensions to Bayesian networks that can represent
these relational dependencies. We show that the theory of rf-separation inaccurately infers
conditional independence when applied directly to the structure of probabilistic models
of relational data. We introduce relational d-separation, a theory for deriving conditional
independence facts from relational models. We provide a new representation, the abstract
ground graph, that enables a sound, complete, and computationally efficient method for
answering rf-separation queries about relational models, and we present empirical results
that demonstrate effectiveness.
Keywords: relational models, d-separation, conditional independence, lifted representa-
tions, directed graphical models
1. Introduction
Bayesian networks are a widely used class of graphical models that are capable of compactly
representing a joint probability distribution over a set of variables. The joint distribution
can be factorized into a product of much smaller conditional distributions by assuming
that variables are independent of their non-descendants given their parents (the Markov
condition). The Markov condition ties the structure of the model to the set of conditional
independencies that hold over all probability distributions the model can represent. Accu-
rate reasoning about such conditional independence facts is the basis for constraint-based
algorithms, such as PC, FCI, and MMHC, that are commonly used to learn the structure of
Bayesian networks (Spirtes et al. , 2000 Tsamardinos et al. , 2006). Under a small number
©2013 Marc Maier and Katerina Marazopoulou and David Jensen.
Maier, Marazopoulou, and Jensen
of assumptions and with knowledge of the conditional independencies, these algorithms can
identify causal structure ( Pearl , 2000 Spirtes et al. , 2000 ) .
Deriving the full set of conditional independencies implied by the Markov condition is
complex, requiring manipulation of the joint distribution and various probability axioms.
Fortunately, the exact same set of conditional independencies entailed by the Markov con-
dition are also entailed by d-separation, a set of graphical rules that algorithmically derive
conditional independence facts directly from the model structure. That is, the Markov
condition and d-separation are equivalent approaches for producing conditional indepen-
dence from Bayesian networks ( Neapolitan! 2004). When interpreting a Bayesian network
causally, the causal Markov condition (variables are independent of their non-effects given
their direct causes) and d-separation have been shown to provide the correct connection
between causal structure and conditional independence (Scheines, 1997).
Bayesian networks assume that data instances are independent and identically dis-
tributed, but many real- world systems are characterized by interacting heterogeneous enti-
ties. For example, social network data consist of individuals, groups, and their relationships;
citation data involve researchers collaborating and authoring scholarly papers that cite prior
work; and sports data comprise players, coaches, teams, referees, and their competitive
interactions. Over the past 15 years, researchers in statistics and computer science have
devised more expressive classes of directed graphical models, such as probabilistic relational
models (PRMs), which remove the assumptions of independent and identically distributed
instances to more accurately describe these types of domains (Getoor and Taskar, 2007).
Relational models generalize models that incorporate interference, spillover effects, or viola-
tions of the stable unit treatment value assumption (SUTVA) (Hudgens and Halloran, 2008
Tchetgen and VanderWeele 2012) and multilevel or hierarchical models (Gelman and Hill
2007). Many practical applications have also benefited from learning and reasoning with
relational models. Examples include analysis of gene regulatory interactions (Segal et al.
2001'), scholarly citations (Taskar et al. , 2001), and biological cellular networks (Friedman
2004j).
Relational models also assume the Markov condition for connecting the underlying joint
distribution with the structure of the model. Just like Bayesian networks, knowledge of
conditional independencies that hold over relational data is important for relational models.
Relational data typically encompass a much larger space of variables; thus, the ability to
factorize the joint distribution into a set of smaller conditional distributions is crucial for
representation and inference. Furthermore, constraint-based algorithms for learning the
structure of relational models, such as Relational PC (Maier et al. 2010), are more efficient
than searching the entire model space, but they rely on conditional independence facts to
discover model structure.
In this paper, we show that d-separation does not correctly produce conditional inde-
pendence facts when applied directly to relational models. We introduce an alternative
representation, the abstract ground graph, that enables an algorithm for deriving condi-
tional independence facts from relational models. We show that this algorithm is sound,
complete, and computationally efficient, and we provide an empirical demonstration of the
effectiveness of our approach across synthetic causal structures of relational domains.
In Section [2| we present an example that highlights why standard d-separation is in-
adequate for relational models. Section [s] offers background on propositional (Bayesian
2
Independence in Models of Relational Data
network) models, and Section |4] formalizes the fundamental concepts of relational data and
models. In Section [5| we define relational d-separation and abstract ground graphs, and we
present our soundness and completeness results. In Section [6j we evaluate how frequently
d-separation on relational models is equivalent to d-separation on abstract ground graphs,
and Section [7] contains additional experimental results. Finally, Section [8] discusses the
implications that relational d-separation has for representation and learning and provides
future directions for this work.
1.1 Why d-separation is Useful
A Bayesian network, as a model of a joint probability distribution, enables a wide array of
useful tasks by supporting inference over an entire system of variables. They have been suc-
cessfully applied to many domains, ranging from bioinformatics and medicine to computer
vision and information retrieval. Naively specifying a joint distribution by hand requires
an exponential number of states; however, Bayesian networks represent joint probability
distributions that can be factorized into a product of conditional probability distributions.
Bayesian networks leverage the Markov condition to represent conditional independencies
in order to compactly specify a joint probability distribution.
Alternative to the Markov condition, but equivalent in its implications, d-separation
provides an algorithmic framework to derive the conditional independencies encoded by
the model. These conditional independence facts are guaranteed to hold in every faithful
joint distribution the model represents and, consequently, any data instance sampled from
those distributions. The semantics of holding across all distributions is the main reason
why d-separation is a useful theory.
Causal discovery, the task of learning generative models of observational data, superfi-
cially appears to be a futile endeavor. Yet learning and reasoning about the causal structure
underlying real domains is a primary goal for many researchers. Fortunately, d-separation
offers a connection between causal structure and conditional independence. The theory of
d-separation can be leveraged to constrain the hypothesis space by eliminating models that
are inconsistent with observed conditional independence facts. While many distributions
do not lead to uniquely identifiable models, this approach (under simple assumptions) fre-
quently discovers useful causal knowledge for domains that can be represented as a Bayesian
network.
As described above, relational models more closely represent the real- world domains
that many social scientists and other researchers investigate. These models can represent
systems of interacting heterogeneous entities and probabilistic dependencies that cross entity
boundaries. In order to successfully learn causal models of observational relational data, we
need a similar theory for deriving conditional independence from relational models. In this
paper, we formalize the theory of relational d-separation, which will provide a theoretical
framework for algorithms that learn causal models of relational domains.
2. Example
Consider a corporate analyst who was hired to identify which products and employees are
effective and productive for some organization. If the company is structured as a pure
project-based organization, the analyst may collect data as described by the relational
3
Maier, Marazopoulou, and Jensen
EMPLOYEE
C!! Z Su ccess
PRODUCT
<^ Budget
Revenue
BUSINESS-UNIT
(a) Example relational schema for an organization consisting of employees working on products, which are
funded by specific business units within a corporation.
^ Salary^
C 3^omp e tence^^
Quinn
C;;^ Salary ^
DEVELOPS
/
^jjCompetenceJ^
Paul
^Success^
Case
^^uccess^
Adapter
^^ Com pet ence] ^
Roger
r
Budget
Devices
^l^ompetence]^
Thomas
C^^I^SLJCcess^^
^^Success^
Laptop
Tablet
(b) Example fragment of a relational skeleton. Roger and Sally are employees, both of whom develop the
Laptop product, but only Sally works on product Tablet. Both products Laptop and Tablet are funded by
business unit Devices.
Figure 1: An example relational schema and skeleton for the organization domain.
schema in Figure 1(a), The schema denotes that employees can collaborate and work on
multiple products, each of which is funded by a specific business unit. The analyst has also
obtained variables on each entity — salary and competence of employees, the success of each
product, and the budget and revenue of business units. In this example, the organization
consists of five employees, five products, and two business units, which are shown in the
relational skeleton in Figure 1(b)
The analyst may believe that the organization operates under the model depicted in
Figure 2(a) For example, the competence of an employee affects the success of products
they develop, and the revenue of a business unit is infiuenced by the success of products
that it funds. The analyst then needs to verify the model structure in order to accurately
advise executive decisions, such as determining which business units should have increased
funding, which employees are being fairly or unfairly compensated, etc. Perhaps the analyst
has experience in graphical models and decides to check that the conditional independencies
encoded by the model are reflected in the data, assuming the faithfulness condition (see
Section [s]). The analyst then naively applies d-separation to the model structure in an
attempt to derive these conditional independencies. However, applying (i-separation directly
to relational models does not correctly derive the set of conditional independencies. In other
words, d-separation applied directly to relational models is not equivalent to the Markov
condition.
4
Independence in Models of Relational Data
[EMPLOYEE, DEVELOPS, PRODUCT, FUNDS, BUSINESS-UNITl.Budget [EMPLOYEEJ.Salary
[EMPLOYEE].Competence (EMPLOYEE].Salary
[PRODUCT, DEVELOPS, EMPLOYEEj.Competence [PRODUCTj.Success
[BUSINESS-UNIT FUNDS, PRODUCTj.Success [BUSINESS-UNITJ.Revenue
[BUSINESS-UNIT].Revenue [BUSINESS-UNITj.Budget
(a) Example relational model. Competence of employees cause the success of products they develop, which
in turn influences the revenue received by the business unit funding the product. Additional dependencies
involve the budget of business units and employee salaries. The dependencies are specified by relational
paths, listed below the graphical model.
(b) Example fragment of a ground graph. The success of product Laptop is influenced by the competence
of both Roger and Sally. The revenue of business unit Devices is caused by the success of all its funded
products — Laptop, Tablet, and Smartphone.
Figure 2: An example relational model and ground graph for the organization domain.
Naively applying d-separation to the model in Figure 2(a)| suggests that employee com-
petence is conditionally independent of the revenue of business units given the success of
products. To see why this approach is flawed, we must consider the ground graph. A nec-
essary precondition for inference is to apply a model to a data instantiation, which yields
a ground graph to which d-separation can be applied. For a Bayesian network, a ground
graph consists of replicates of the model structure for each data instance. In contrast, a
relational model defines a template for how dependencies apply to a data instantiation, re-
sulting in a ground graph with varying structure (see Section [1] for more details on ground
graphs) .
Figure 2(b) [ shows the ground graph for the relational model in Figure 2(a) applied to the
relational skeleton in Figure 1(b) This ground graph illustrates that simply conditioning on
product success can activate a path through the competence of other employees who develop
the same products — we call this a relational d-connecting path^ Checking d-separation on
1. The indirect effect attributed to a relational rf-connecting path is often referred to as interference, a
spillover effect, or a violation of the stable unit treatment value assumption (SUTVA) because the
treatment of one instance (employee competence) affects the outcome of another (the revenue of another
employee's business unit).
5
Maier, Marazopoulou, and Jensen
the ground graph indicates that to (i-separate employee competence from business unit
revenue, we must not only condition on the success of developed products, but also on the
competence of other employees who work on those products (e.g., Roger. Competence _LL
Devices. -Refenue | {Laptop. Success, Sally.Competence}).
This example also demonstrates that the Markov condition can be violated when di-
rectly applied to the structure of a relational model. In this case, the Markov condition
applied to the model in Figure 2(a) implies that P {Competence, Revenue \ Success) =
P{Competence \ Success) P{Revenue \ Success), that revenue is independent of its non-
descendants (competence) given its parents (success). However, the ground graph shows
the opposite, for example, P (Rogei. Competence, Devices. Revenue \ Laptop. Success) /
P (Roger. Competence [Laptop. Success) P(Devices. -Reuenue | Laptop. Success). This is not
surprising since the conditional independencies entailed by (i-separation are known to be
equivalent to those entailed by the Markov condition (Neapolitan, 2004). In fact, this is not
the only conditional independence fact for which d-separation produces an incorrect result
for this example model. Over 80% of the pairs of relational variables cannot be described
by direct inspection of the model, and of those that can (such as the above example), 91%
yield an incorrect conclusion. This is a single data point of a larger empirical evaluation
presented in Section [6j
It might appear that, since the standard rules of d-separation apply to Bayesian net-
works and the ground graphs of relational models are also Bayesian networks, that ap-
plying d-separation to relational models is a non-issue. However, applying d-separation
to a single ground graph may result in potentially unbounded runtime if the instantiation
is large — especially given that relational databases can be arbitrarily large. Further, and
more importantly, the semantics of d-separation require that conditional independencies
hold across all possible model instantiations. Although d-separation can apply directly to
a ground graph, these semantics prohibits reasoning about a single ground graph.
The conditional independence facts derived from (/-separation hold for all faithful dis-
tributions represented a Bayesian network. Therefore, the implications of relational d-
separation should analogously hold for all faithful distributions of variables for the space
of all possible ground graphs. It is simple to show that ^-separation holds for any ground
graph of a Bayesian network — every ground graph consists of a set of disconnected sub-
graphs, each of which has a structure that is identical to that of the model. However,
relational models are templates, and the resulting ground graphs vary with the relational
structure of the underlying data (e.g., different products are developed by varying numbers
of employees). As a result, relational ^-separation queries must be answered without re-
spect to ground graphs. Additionally, the example illustrates how relational dependencies
can exhibit d-connecting paths that are only manifest in ground graphs, not the model
representation. In Section [5j we describe a new representation, the abstract ground graph,
that can be used to reason about d-separation for relational models.
3. Background on Propositional Data and Models
A common assumption in classical statistics, machine learning, and causal discovery is that
data instances are independent and identically distributed (IID). The first condition assumes
that the variables on any given data instance are marginally independent of the variables
6
Independence in Models of Relational Data
of any other data instances. The second condition assumes that every data instance is
drawn from the same underlying joint probability distribution. IID data (also referred to
as propositional datgj^ are effectively represented as a single table, where rows correspond
to independent instances and columns are attributes of those instances.
The directed acyclic graph (DAG), or Bayesian network, is a widely used probabilistic
graphical model of propositional data (Pearl, 1988). A DAG, G = (V, E), consists of a
set of vertices V corresponding to random variables in the data and a set of edges E C
V X V encoding the conditional independencies expected to hold among the variables. Each
variable y G V is described by a conditional probability distribution P{V \ parents{V)),
where parents{V) C V \ {V} is the set of parent variables for V.
If the joint probability distribution -P(V) satisfies the Markov condition for G, then
P(V) can be compactly factorized according to the conditional distributions, P(V) =
Y\ve'V I po^ents(y)). The Markov condition states that every variable V G V is
conditionally independent of its non-descendants given its parents, where the descendants
of V are all variables reachable by a directed path from V. Deriving the set of conditional
independencies from G based on the Markov condition is cumbersome, requiring complex
combinations of probability axioms. Fortunately, d-separation, a set of graphical criteria,
provides the foundation for algorithmic derivation of all conditional independencies in G
Markov condition (Neapolitan, 2004)
(Geiger et al. , 1990) and entails the exact same set of conditional independencies as the
In the following definition, a path is a sequence of vertices following edges in either
direction. We say that a variable y is a collider on a path p if the two arrowheads point at
each other (collide) at V; otherwise, y is a non-collider on p.
Definition 3.1 (d-separation) Let X, Y, and Z be disjoint sets of variables in directed
acyclic graph G. A path from some X G X to some 1" G Y is d-connected by Z if and only
if every collider W on the path, or a descendant of W, is a member of Z, and there are no
non-colliders in Z. Then, say that X and Y are d-separated by Z if and only if there are
no (i-connecting paths between X and Y given Z.
Figure 3(a) depicts the graphical patterns found along paths that lead to d-separation
or d-connection based on Definition 3.1, and Figure [3(b) [ provides example d-separated and
c?-connected paths. At a first glance, identifying conditional independence facts from the
rules of d-separation appears computationally intensive, testing a potentially exponential
number of paths. However, Geiger et al. (1990) provide a linear-time algorithm based on
breadth-first search and reachability on G.
Under a few assumptions, Bayesian networks can be interpreted causally, with edges
corresponding to direct causal dependencies. If X —t- y is an edge in the causal model
G, then manipulating or changing the value of X will alter the conditional distribution of
Y — P{Y I do{X)) using Pearl's do-calculus notation for interventions (Pearl, 2000). The
causal interpretation of G assumes the causal Markov condition, which is identical to the
Markov condition, replacing parents with direct causes and non-descendants with non-
effects. In order for the causal Markov condition to hold, the variables V must also be
2. IID data are typically referred to as propositional because the data can be equivalently expressed under
propositional logic.
7
Maier, Marazopoulou, and Jensen
(/-separating path elements
(exists one on path)
Q I ^ I
(/-connecting path elements
(exists all on path)
(xW>-®
®
(a) Graphical patterns of d-separating and ^-connecting path elements among disjoint sets of variables X
and Y given Z. Paths for which there exists a non-collider in Z or a collider not in Z are d-separating.
Paths for which all non-colliders are not in Z and all colliders (or a descendant of colliders) are in Z are
rf-connecting.
(/-separated paths (/-connected paths
(b) Several example d-separated and d-connected paths that illustrates the composition of path elements.
Figure 3: Patterns of (^-separating and d-connecting path elements and example d-
separating and (/-connecting paths.
causally sufficient — there can be no latent common causes for any pair of variables in V.
The causal Markov condition is also equivalent to (/-separation; therefore, both provide the
connection between causal structures and probability distributions.
The conditional independencies entailed by both the causal Markov condition and d-
separation hold over all faithful distributions that G can represent. A distribution P is
faithful to G if and only if all conditional independencies in P that are entailed by the
causal Markov condition on G. Additionally, if P is assumed to be faithful to G, then there
are algorithms that can learn the Markov, or likelihood, equivalent set of causal models.
That is, these algorithms correctly identify the causal dependencies in G that are consistent
with observed conditional independence facts across V. These algorithms typically exploit
8
Independence in Models of Relational Data
the rules of d-separation and the assumption of model acyclicity to determine the direction
of causality (Spirtes et al. , 2000)
4. Concepts of Relational Data and Models
Propositional representations describe domains with a single entity type, but many real-
world systems involve multiple types of interacting entities with probabilistic dependencies
that cross the boundaries of entities. The model in Figure 2(a) consists of three such de-
pendencies (e.g., the competence of employees affect the success of products they develop).
Many researchers have focused on representing domains characterized by these relational
properties. In this section, we formally define the concepts of relational data and mod-
els, providing the basis for the theoretical framework for relational d-separation. We also
show that the relational representation is strictly more expressive than the propositional
representation used in Bayesian network modeling.
A relational schema is a top-level description of what data exist in a particular domain.
Specifically (adapted from Heckerman et al. , 2007):
Definition 4.1 (Relational schema) A relational schema S = {£,TZ,A) consists of a
set of entity classes £ = {Ei, . . . , Em}; relationship classes TZ = {Ri, . . . , Rn}, where each
Ri = {El, . . . , Ej} with Ej G £; attribute classes A{I) for each item / G £ U TZ; and
cardinality function card{R, E) = {one, many} for each R & TZ and each E & R.
A relational schema can be represented graphically with an entity-relationship (ER)
diagram. We adopt a slightly modified ER diagram using Barker's notation (1990), under
which entities are rectangular boxes, relationships are diamonds with dashed lines connect-
ing its associated entities, attributes are ovals residing on entities and relationships, and
cardinalities are represented with crow's foot notation.
Example 4.1 The relational schema S for the organization domain example depicted in
Figure l(a)| consists of entities £ = {Employee, Product, Business-Unit}; relation-
ships TZ = {Develops, Funds}, where Develops = {Employee, Product}, Funds
= {Business-Unit, Product} and having cardinalities card(DEVELOPS, Employee) =
MANY, car(i(DEVELOPS, Product) = many, card(FuNDS, Business-Unit) = one, and
car(i(FuNDS, Product) = many; and attributes ^(Employee) = {Competence, Salary},
^^(Product) = {Success}, and ^(Business-Unit) = {Budget, Revenue}. □
Propositional representations describe domains with a single entity class; thus, they pro-
duce simple schemas with = 1 (one entity class) and |7^| = (no relationship classes). For
the organization domain example, consider data about only employees {£ = {Employee}).
Variables would include intrinsic attributes, such as competence and salary, but could also
include variables describing other related entities, all from the employee perspective (see
Figure 4(a) for the ER diagram of such a simple schema)]^ That is, we could construct
a single table for employees that includes columns for the proportion of successfully de-
veloped products, the total revenue of all business units they work under, etc. Note that
3. This technique of translating a relational database down to a single, propositional representation is often
referred to as proposittonalization (Kramer et al. 20011
9
Maier, Marazopoulou, and Jensen
^ ^ompetence^ RVk^^ ^
EMPLOYEE
(a)
Roger
(b)
Figure 4: Relational schema (a) and skeleton (b) for a propositional representation of em-
ployees from the organization domain example. The skeleton consists of indi-
vidual, disconnected entity instances. Each RVj is a relational variable from the
Employee perspective, such as the proportion of products developed successfully
or the total revenue of involved business units.
propositionalizing inherently relational data still requires the IID assumption, which is of-
ten violated since variables of one instance can influence variables of another. For example.
according to the model in Figure 2(a) the competence of collaborating employees influences
the success of products, which affects the revenue of business units, which affects its bud-
get, thereby influencing an employee's salary. As a result, modeling relational data with a
propositional representation may unnecessarily lose valuable information, especially in the
context of causal reasoning and accurate estimation of causal effects.
A relational schema is a template for the underlying relational skeleton (sometimes re-
ferred to as the data graph), a specific instantiation of entities and relationships. Specifically
(adapted from Heckerman et al.\ 2007):
Definition 4.2 (Relational skeleton) A relational skeleton a is an instantiation of entity
sets (t{E) for each E & £ and relationship sets cr{R) for each R & TZ, adhering to its
cardinality constraints. Let r G cr{R) where R = {Ei, . . . , Ej} be denoted as r(ei, . . . , ej)
where G cr(£'j) and Ei £ £.
Example 4.2 The relational skeleton a for the organization example is depicted in Fig-
ure 1(b), In this small skeleton, there are entity sets (t(Employee) = {Paul, Quinn,
10
Independence in Models of Relational Data
Roger, Sally, Thomas}, (t(Product) = {Case, Adapter, Laptop, Tablet, Smartphone} , and
(t(Business-Unit) = {Accessories, Devices}. There are also relationship sets (t(Develops)
= {develops(Paul, Case), develops(Quinn, Case), . . . , develops (Thomas, Smartphone)} and
cr(FuNDS) = {fmids(Accessories, Case), funds(Accessories, Adapter), funds(Devices,
Smartphone)}. The relationship sets adhere to their cardinality constraints (e.g.. Funds is
a ONE-to-MANY relationship — within cr(FUNDS), every product has a single business unit,
and every business unit may have multiple products). □
The relational skeleton of a propositional representation consists of a set of discon-
nected entity instances, all drawn from the same entity class. Consequently, the skeleton
has a simple one-to-one mapping with the representation as a table: Each entity instance
corresponds to a single row, and each variable is a column. Figure |4(b)| shows a skeleton
for the organization domain example had it been propositionalized from the employee per-
spective. Each employee is an entity instance with variables drawn from intrinsic attributes
and possibly variables describing other related entities. No instances of other entity types
or relationships exist in the skeleton of a propositional representation.
In order to specify a model over a relational domain, we must define a space of possi-
ble variables and dependencies. Consider the example dependency [Product, Develops,
Employee]. Competence — )• [Product]. Success from the model in Figure 2(a) The van
ables here include an intrinsic entity attribute (success of a product) and also a variable on
another entity (the competence of employees that develop a product). For relational data,
the variable space includes not only intrinsic entity and relationship attributes, but also the
variables on other entities and relationships that are reachable by paths along the relational
skeleton. As above, paths in a relational skeleton are instantiations of path templates on a
relational schema.
Definition 4.3 (Relational path) A relational path [/i, . . . ,Ik] for relational schema S
is an alternating sequence of entity and relationship classes Ii, . . . ,Ik € £ L)TZ such that
for ah j > 1: (1) if Ij G £,
then
G TZ and Ij participates in {Ij G
(2) if
Ij G TZ, then Ij^i G £ and Ij-i participates in Ij {Ij-i G Ij), and (3) for each ordered
triple Ij, in [/i, . . . ,Ik], if Ij G TZ, then ^ /j+i and if Ij
Ij-i 7^ Ij+i or 3/e G Ij~i such that Ij ^ le and card{Ij-i, le) = many.
base item, or perspective, of the relational path.
G £, then either
Ii is called the
Definition 4.3 generalizes the notion of "slot chains" from the PRM framework (Getoor
et al. , 2007) by including cardinality constraints and formally describing the semantics
under which repeated item classes may appear on a path. Conditions (2) and (3) in the
definition essentially remove any paths that would invariably reach an empty terminal set
(see Definition 4.4 below) Also, since relational paths may become arbitrarily long, the
path length is ordinarily limited by a user-specified, domain-specific hop threshold.
4. The condition Ij £ TZ ^ 7^ Ij+i suggests at first glance that self-relationships (e.g., employees
manage other employees, individuals in social networks maintain friendships, scholarly articles cite other
articles) are prohibited. However, a common procedure in ER modeling is to map entity names to
role names within the self-relationship (e.g., manager/subordinate, friend l/friend2, citing-paper/cited-
paper), which accords with our definition of relational paths. For simplicity, we omit this additional
layer of complexity.
11
Maier, Marazopoulou, and Jensen
Example 4.3 From the example relational schema in Figure 1(a), we can derive the set
of all relational paths up to some hop threshold. Let the hop threshold be /i = 4. Then,
the set of relational paths from the Employee perspective would include the following:
[Employee] (employees, hops), [Employee, Develops, Product] (products developed
by employees, 2 hops), [Employee, Develops, Product, Funds, Business-Unit] (busi-
ness units of the products developed by employees, 4 hops), and [Employee, Develops,
Product, Develops, Employee] (other employees developing the same products, 4 hops).
Invahd relational paths would include [Employee, Develops, Employee] (because Em-
ployee=Employee and Developsg TZ) and [Business-Unit, Funds, Product, Funds,
Business-Unit] (because Products £ and card(FuNDS, Business-Unit) = one). □
An instantiated relational path produces a set of traversals on a relational skeleton.
However, the quantity of interest is not the paths themselves, but the set of reachable items
(i.e., entity or relationship instances):
Definition 4.4 (Terminal set of a relational path) For any skeleton a and any ii G
(t(/i), a terminal set P\i^ for relational path P = [/i, . . . , 7^] can be defined inductively as
[h]W={ii}
[/i, . . . ,/fc_i,4]|ji = [J {ik I {{ik-i G ik if 4 G 7^) V {ik G ik-i if h G <?))
i.-ie[/i,...,4_i]lH A ^ [/i, . . for j = 1 to A: - 1}
A terminal set of a relational path consists of reachable instances of class I^, the terminal
item on the path. Conceptually, a terminal set is produced by traversing the skeleton
beginning at a single base item ii E ^■(A), following instances of the items in the relational
path, and reaching a target set of Ik instances. The definition implies a "bridge burning"
semantics under which no instantiated items are revisitedl£]
Example 4.4 We can generate terminal sets of a relational path by pairing the set of
relational paths produced by the schema in Figure 1(a) with the relational skeleton in
Figure 1(b) Let Quinn be our base item instance. Then [Employee] Iquinn = {Quinn},
[Employee, Develops, Product] Iquinn = {Case, Adapter, Laptop}, [Employee, Devel-
ops, Product, Funds, Business-Unit] Iquinn = {Accessories, Devices}, and [Employee,
Develops, Product, Develops, Employee] Iquinn = {Paul, Roger, Sally}. The bridge
burning semantics enforces that Quinn is not also included in this last terminal set. □
Most relational paths start and end with different item classes. However, there are pairs
of distinct srelational paths that start and end with the same item classes. For these pairs,
it is possible that their terminal sets, when originating at the same base item instance,
will have items in common. The following lemma states that if two relational paths with
the same base and target items diverge in the middle of the path, then for some relational
5. The bridge burning semantics yields terminal sets that are necessarily subsets of terminal sets which
would otherwise be produced without bridge burning. Although this appears to be limiting, it actually
enables a strictly more expressive class of relational models. See Appendix |B] for more details and an
example.
12
Independence in Models of Relational Data
skeleton, their terminal sets will have a non-empty intersection. This property is important
to consider for relational d-separation, and this is the only form for which non-empty
intersection can occur.
Lemma 4.1 For any schema S and any two relational paths Pi = [/i, . . . , /m, . . . , I^] and
P2 = [h, . . . , In, ■ ■ ■ , Ik] with Im ^ In, there exists a skeleton a such that n P2|n /
for some ii G (^(/i).
Proof. See Appendix \K\
Example 4.5 Let Pathi = [Employee, Develops, Product, Develops, Employee,
Develops, Product], the terminal sets for which yields other products developed by col-
laborating employees. Let Path2 = [Employee, Develops, Product, Funds, Business-
Unit, Funds, Product], for which terminal sets consist of other products funded by
the business units funding products developed by a given employee. Intersection among
terminal sets for these paths occurs even in this small skeleton. In fact, the intersection
of the terminal sets for Pathi and Path2 is non-empty for all employees. For example,
Paul: Pathi|paui = {Adapter, Laptop} and Path2|paui = {Adapter}; Quinn: Pathilquinn =
{Tablet} and Path2| Quinn = {Tablet, Smartphone}. □
Given the definition for relational paths, it is simple to define relational variables and
their terminal sets.
Definition 4.5 (Relational variable) A relational variable [Ii, . . . ,Ik\.V for schema S
consists of a relational path [/i, . . . , 1^] and an attribute class V G A{Ik)-
Example 4.6 Relational variables for the relational paths in Example |4. 3| include intrinsic
variables such as [Employee]. Competence and [Employee]. S'a/ary, and also variables on
related entities such as [Employee, Develops, Product]. S'ltccess, [Employee, Devel-
ops, Product, Funds, Business-Unit]. i?e?;enMe, and [Employee, Develops, Prod-
uct, Develops, Employee]. S'a/ary. □
Definition 4.6 (Terminal set of a relational variable) For any skeleton a and ii G
(t(/i), a terminal set P-V\ii for relational variable P.V = [Ii, . . . ,Ik]-V is the set of variable
instances {ik-V \ V £ A{ik) A G P\ii}-
Example 4.7 Terminal sets for base item instance Sally and the relational variables from
Example 4.6 include [Employee]. Competence [saiiy = {Sally. Competence}, [Employee,
Develops, Product]. Success [saiiy = {Laptop. Success, Tablet. Success}, [Employee,
Develops, Product, Funds, Business-Unit]. i?ewentie|saiiy = {Devices. i^eueniie}, and
[Employee, Develops, Product, Develops, Employee]. S'a/arylsaiiy = {Quinn. S'a/ary,
Thomas. S'a/ary}. □
As a notational convenience, if X is a set of relational variables, all from a common
perspective Ii, then we say that X|jj for some item ii G o"(/i) is the union of all terminal
sets, {x \ X G A X G X}. Given the formal definitions for relational variables, we can
now define relational dependencies.
13
Maier, Marazopoulou, and Jensen
Definition 4.7 (Relational dependency) A relational dependency [Ii, . . . ,Ik]-Vi — t-
[Ii].V2 consists of two relational variables with a common base item and corresponds to a
directed probabilistic dependence from [Ii, . . . , Ik]-Vi to [/i].V2.
Depending on the context, [/i, . . . , Ik]-Vi and [/i].V2 can be referred to as treatment and
outcome, cause and effect, or parent and child. Without loss of generality, Definition 4/7_ pro-
vides a canonical specification for dependencies, with the child relational variable restricted
to singleton paths, thus ensuring that the terminal sets of a child relational variable consist
of a single value.
Example 4.8 The dependencies in the relational model displayed in Figure |2(a) can
be specified as: [Product, Develops, Employee]. Competence — )• [Probvct]. Success
(product success is influenced by the competence of the employees developing the prod-
uct), [Employee]. Competence — [Employee]. Salary (an employee's competence affects
her salary), [Business-Unit, Funds, Product]. Success [Business-Unit]. i?efenwe
(the success of the products funded by a business unit influences that unit's revenue),
[Employee, Develops, Product, Funds, Business-Unit]. i?u(i5et—)>[EMPLOYEE].5a/ari/
(employee salary is governed by the budget of the business units for which they develop
products), and [BusiNESS-UNlT].i2euenue — ^ [BusiNESS-UNiT].i?ndg'ei (the revenue of a
business unit influences its budget). □
A relational model is simply a collection of relational dependencies, defined as:
Definition 4.8 (Relational model) The structure of a relational model A4 = {S,T>)
consists of a schema S paired with a set of relational dependencies P defined over S.
A relational model can be represented graphically by superimposing dependencies on
the ER diagram of a relational schema (see Figure 2(a) for an example). This definition
of relational models is consistent with and yields structures expressible as DAPER models
(Heckerman et al. , 2007). These relational models are also equivalent to PRMs, but we gen-
eralize slot chains as relational paths and provide a formal semantics for their instantiation
(i.e., terminal sets of relational paths and relational variables). These models are also more
general than plate models because dependencies can be specified with arbitrary relational
paths as opposed to simple intersections among plates (Buntine, 1994 Gilks et al. , 1994).
Relational models only define coherent joint probability distributions if they produce
acyclic model instantiations (i.e., ground graphs, defined below). A useful construct for
checking model acyclicity is the class dependency graph (Getoor et al. , 2007), defined as:
Definition 4.9 (Class dependency graph) The class dependency graph Cm = (^i E)
for relational model Ai = {S,T)) is a directed graph with nodes for each attribute of each
item class V = {I.V \ I G £UlZ AUG -^[I)} and edges between pairs of attributes supported
by relational dependencies in the model E = {Ik-Vi — >• Ij-y2 \ [Ij-, ■ ■ ■ ■, -^fe]-Ui — )• [/j].U2 G V}.
If the relational dependencies form an acyclic class dependency graph, then every possi-
ble ground graph of that model is acyclic as well (Getoor et al. , 2007pn All future references
Getoor et al.
(20071 additionally define guaranteed acyclic relationships which enable cyclic relational
dependencies that are guaranteed to produce only acyclic instantiations. The class dependency graphs
are accordingly "colored," and acyclicity is checked with respect to stratification across colors. We omit
this type of relationship for simplicity of developing the relational rf-separation theory.
14
Independence in Models of Relational Data
^Competence
(a)
; Paul.RV2 ]2>
Paul.Competence
Sally.Salary ~~2 >^^ -C ^Sally.RVa 2 I>
Piul.RVr: :^ I ^C^^allyRvT ^ |
Sally.Competence (— »< ^Sally.RVk7I >
"Quinn.RV23>
g'~Quinn.Competence~~a^ L-»< CrQuinn.RVk^
dThomas.Salary]3
i L
| [Thomas.RV2^
cCThomas.Competencr>^ '~-»<lT^°"''^s.R\7r>
gr~RoagrComDetence^a^ ^ — ►C lRPger-RvC ^
(b)
Figure 5: Example Bayesian network (a) and ground graph (b) for the propositional rep-
resentation of employees from the organization domain. The model consists of
simple dependencies involving employee variables. The ground graph is a set of
identical copies of the model structure, one for each instance in the data (skele-
ton).
to acyclic relational models refer to relational models whose dependency structure forms
acyclic class dependency graphs. A parameterized relational model contains conditional
probability distributions for every attribute class A{1^ for each / G £^ U 7^ in order to rep-
resent a joint probability distribution. Similar to Bayesian networks, the joint distribution
factorizes according to the conditional distributions given a relational skeleton o as
P{GGMa) = n n n ^(^-^ l ParentsiLX))
lesunxeAii) iea{i)
where GGmct is the ground graph of the model and skeleton, defined below. Note that
without a generative model of relational skeletons, these relational models are not truly
generative as the skeleton must be generated prior to the attributes. However, the same issue
occurs for Bayesian networks — relational skeletons consist of disconnected entity instances,
but the number of such instances to create is not described by the model. We choose simple
processes for generating skeletons, allowing us to focus on relational models of attributes
and leaving structural causes and effects as future work.
15
Maier, Marazopoulou, and Jensen
A common propositional model is the Bayesian network, as described in Section |3j
Because all variables are defined for a single entity type and no relationships exist, the
relational path specification becomes trivial and, hence, implicit. All relational paths,
relational variables, and relational dependencies are defined from a single perspective with
singleton paths (e.g., [Employee]). Figure 5(a) depicts an example Bayesian network for
the schema in Figure 4(a) The entity box surrounding the model is implicit given a single
entity type.
Just as the relational schema is a template for skeletons, a relational model can be
viewed as a template for ground graphs — how dependencies apply to skeletons.
Definition 4.10 (Ground graph) The ground graph GGmo = (^j E) for relational model
M = {S, T>) and skeleton cj is a directed graph with nodes V = A{cr) = {i.X \ I G
Sun A X e A{I) A i G (t(/)} and edges E = {ik-Y ij.X \ ik.Y,ij.X eV A ik-Y
€[Ij,...,Ik].Y\i^ A [Ij,...,h].Y ^[Ij].X eV}.
A ground graph is a large directed graph, with a node for every variable of every entity
and relationship instance in a skeleton, and an edge between pairs of variable instances
that belong to the terminal sets of the relational variables of all dependencies in a model.
In fact, given an acyclic relational model, the ground graph has the same semantics as a
Bayesian network (Getoor, 2001; Heckerman et al. 2007). The ground graph in Figure 2(b)
is the result of applying the dependencies in the relational model shown in Figure 2(a)
the skeleton in Figure 1(b)
to
The ground graph of a Bayesian network, like the skeleton of a propositional schema,
has a very regular structure. The ground graph consists of a set of independent identical
instances of the model structure, one for each instance in the data. Figure |5(b)| shows the
resulting ground graph of applying the model to the data, or skeleton, of Figure |4(b)[
By Lemma 4.1 and Definition 4.10 it is clear that the same canonical dependency
connects many other relational variables. If the terminal sets involve ik-Y and ij.X, then
there is an implied dependency between all other relational variables for which i^.Y and
ij.X are elements.
Example 4.9 The canonical dependency [Employee]. Compefence—7>[EMPLOYEE]. Salary
yields the edge Koger. Competence — )• Kogei. Salary in the ground graph of Figure 2(b)] be-
cause YLogQT .Competence (z [Employee]. Co77ipete7ic6| Roger-
However, Kogei. Competence
£ [Employee, Develops, Product, Develops, Employee]. Competence [saiiy (as is
Kogev. Salary, replacing Competence with Salary). Consequently, the canonical depen-
dency implies dependence among the variable instances in the terminal sets for [Employee,
Develops, Product, Develops, Employee]. Competence and [Employee, Develops,
Product, Develops, Employee]. 5a/ary. □
These implied dependencies form the crux of the challenge of identifying independence in
relational models. Additionally, the intersection between the terminal sets of two relational
paths is crucial for reasoning about independence because variable instances can belong
to the terminal sets of multiple relational variables. Since d-separation only guarantees
independence when there are no (i-connecting paths, we must consider all possible paths
among variable instances, any one of which may be a member of multiple relational variables.
16
Independence in Models of Relational Data
In Section [5} we define relational c^-separation and provide an appropriate representation,
the abstract ground graph, that enables straightforward reasoning about (/-separation.
5. Relational d-separation
Conditional independence facts are entailed by the rules of d-separation, but only for simple
directed acyclic graphs. As described in Section |4j every ground graph of a Bayesian
network consists of a set of identical copies of the model structure. Thus, the implications
of d-separation on Bayesian networks hold for every ground graph. In contrast, relational
models are templates for ground graphs that vary with underlying skeletons. That is, the set
of distributions represented by a relational model is not only parameterized by conditional
probabilities, but also by the space of valid skeletons given by the schema. Since conditional
independence facts are only useful when they hold across all possible model instantiations,
reasoning about (i-separation for relational models is inherently more challenging and leads
to the following definition:
Definition 5.1 (Relational rf-separation) Let X, Y, and Z be three distinct sets of
relational variables for perspective B £ £ U TZ defined over relational schema S. Then, for
relational model M, X and Y are d-separated by Z if and only if, for any skeleton a, X|fe
and Y|fe are (/-separated by Z\b in ground graph GGmu for all h G cr{B).
In other words, for X and Y to be (/-separated by Z for relational model A^, (i-separation
must hold for all instantiations of those relational variables for any possible skeleton. This is
a conservative definition, but it is consistent with the semantics of (i-separation on Bayesian
networks — it guarantees independence, but it does not guarantee dependence. If there exists
even one valid skeleton and faithful distribution represented by the relational model for
which X 4^ Y I Z, then X and Y are not (/-separated by Z.
Given the semantics specified in Definition |5.1[ answering relational (/-separation queries
is challenging for the following reasons:
All- ground- graphs semantics: Although it is possible to verify (i-separation on a single
ground graph of a relational model, the conclusion may not generalize and ground graphs
can be arbitrarily large. The semantics of (i-separation on Bayesian networks also holds for
all possible ground graphs. However, the ground graphs of a Bayesian network consist of
identical copies of the structure of the model, and it is sufficient to reason about (i-separation
on a single subgraph.
Relational models are templates: Relational models may appear to be directed acyclic
graphs, but they are templates for constructing ground graphs. The rules of (i-separation
do not directly apply to relational models, only to its ground graphs. Applying the rules
of (i-separation to a relational model frequently leads to incorrect conclusions because of
unrepresented confounding paths that are only manifest in ground graphs.
Terminal sets of relational variables may intersect: The terminal sets of two different
relational variables may have non-empty intersections, as described by Lemma 4.1 Con-
sequently, there exist non-intuitive implications of dependencies that (/-separation must
account for, such as the relational (/-connecting paths in the example in Section [2j
Relational dependency specification: Relational models are defined with canonical de-
pendencies, each specified from a single perspective. However, variables in a ground graph
17
Maier, Marazopoulou, and Jensen
may contribute to the terminal sets of multiple relational variables, each defined from dif-
ferent perspectives. Thus, we need methods to translate and extend canonically specified
dependencies to produce the implied dependencies between arbitrary relational variables,
such as the implied dependency described in Example |4.9[
5.1 Abstracting over All Ground Graphs
The definition of relational d-separation and its challenges suggest a solution that abstracts
over all possible ground graphs and explicitly represents the potential intersection between
the terminal sets of pairs of relational variables. We introduce a new lifted representation,
called the abstract ground graph, that captures all dependencies among arbitrary relational
variables for any ground graph, using knowledge of only the schema and the model.
Definition 5.2 (Abstract ground graph) An abstract ground graph AGGMBh = (^^E)
for relational model Ai = perspective B £ S U TZ, and hop threshold /i G N*^ is an
abstraction of the dependencies P for all possible ground graphs GGm^ of on arbitrary
skeletons cr.
The set of nodes in AGGMBh-, V = RVUIV , is the union of all relational variables RV =
[\^B, . . . , Ij\.V I length{[B , . . . , Ij]) < /i+l} and the intersections between pairs of relational
variables that could intersect IV = {XnY \ X,Y £ RV A X = [B, . . . , 1^, . . . , Ij].V A Y =
[B,...,Ii,...,Ij].V A h^h).
The set of edges in AGGmbh is E = RVE U IVE, where RVE C RV x RV and
IV E C IV X RV U RV X IV. RVE is the set of edges between pairs of relational variables:
RVE = {[B,..., Ik].Vi ^ [B,..., Ij].V2 \{Ij,..., Ik]-Vi ^ [Ij]-V2 A [i?, . . . , 4] e
extend{[B, . . .,/,], [Ij, . . . ,4])}.
IVE is the set of edges inherited from both relational variable sources of every inter-
section variable: IVE = {X ^ [B, . . . ,Ij].V2 \ X = Pi.Vi D P2.V1 e IV A {Pi.Vl
[B,..., Ij].V2 e RVE V P2.V1 ^ . . . , Ij].V2 e RVE)} U {[B,...,Ij].Vi ^ X \ X =
P1.V2nP2.V2 €IV A {[B, . . .,Ij].Vi Pi.Vl G RVE V [B,.. .,Ij].Vi P2.V1 £ RVE)}.
The extend method is described in Definition 5.3 below. Essentially, the construction of
an abstract ground graph for relational model Ai, perspective B £ SUTZ, and hop threshold
h follows three simple steps: (1) Add a node for all relational variables, with relational path
length limited by h. (2) Insert edges for the direct causes of every relational variable. (3)
For each pair of potentially intersecting relational variables, add a new "intersection" node
that inherits the direct causes and effects from both of its sources. Then, to answer queries
of the form "Are X and Y d-separated by Z?", simply (1) augment X, Y, and Z with their
corresponding intersection variables and (2) apply the rules of d-separation on the abstract
ground graph for the common perspective of X, Y, and Z.
Example 5.1 Figure [6] shows the abstract ground graph AGGm Employee 6 for the organi-
zation example from the Employee perspective with hop threshold /i = 6j^ As in Section [2|
we derive an appropriate conditioning set Z in order to d-separate individual employee com-
petence (X = {[Employee]. Competence}) from the revenue of the employee's funding busi-
ness units (Y = {[Employee, Develops, Product, Funds, Business-Unit]. fieuenue}).
7. The variables Salary and Budget are removed for simplicity. They are irrelevant for this rf-separation
example as they are solely effects of other variables.
18
Independence in Models of Relational Data
[EMPLOYEE, DEVELOPS, PRODUCT, FUNDS, BUSINESS-UNIT, FUNDS, PRODUCT] .Success
(TemPLOYEE] .Competence^ ►(^[IemPLOYEE, DEVELOPS, PRODUCT].Succesr]) ►<^||^]jiMPLOYEE, DEVELOPS, PRODUCT, FUNDS, BUSINESS-UNItl-ffei^eni/e^
7
^EMPLOYEE, DEVELOPS, PRODUCT, DEVELOPS, EMPLOYEE]. Compe(eiTce
z
EMPLOYEE, DEVELOPS, PRODUCT, DEVELOPS, EMPLOYEE, DEVELOPS, PRO DUCT]. Success
n
EMPLOYEE, DEVELOPS, PRODUCT, FUNDS, BUSINESS-UNIT, FUNDS, PRODUCTJ.Success
[EMPLOYEE, DEVELOPS, PRODUCT, DEVELOPS, EMPLOYEE, DEVELOPS, PRODUCT].Soccess
Figure 6: The abstract ground graph for the organization domain model in Figure 2(a)
from the Employee perspective with hop threshold h = 6 (with the variables for
Salary and Budget omitted for simplicity). This abstract ground graph includes
one intersection node.
Applying the rules of (f-separation to the abstract ground graph, we see that it is necessary
to condition on both product success ([Employee, Develops, Product]. Success) and
the competence of other employees developing the same products ([Employee, Develops,
Product, Develops, Employee]. Competence). For h = 6, augmenting X, Y, and Z with
their corresponding intersection variables does not result in any changes. For h = 8, the ab-
stract ground graph includes a node for relational variable [Employee, Develops, Prod-
uct, Develops, Employee, Develops, Product, Funds, Business-Unit]. .Revenue
(the revenue of the business units funding the other products of collaborating employees)
which, by Lemma 4.1, could have a non-empty intersection with [Employee, Develops,
Product, Funds, Business-Unit]. i?euenne. Therefore, Y would also include the inter-
section with this other relational variable. However, for this query, the conditioning set Z
for h = 6 happens to also d-separate at /i = 8 (and any h £ N^). □
Using the algorithm devised by Geiger et al. (1990), relational c?-separation queries
can be answered in 0(|ii^|) time with respect to the number of edges in the abstract
ground graph. In practice, the size of an abstract ground graph depends on the relational
schema and model (i.e., the number of entities, relationships, cardinalities, attributes, and
dependencies — see the experiment in Section 7.1), as well as the hop threshold limiting the
length of relational paths. For the example in Figure [6| the abstract ground graph has 7
nodes and 7 edges (including 1 intersection node with 2 edges); for /i = 8, it would have 13
nodes and 21 edges (including 4 intersection nodes with 13 edges). Abstract ground graphs
are invariant to the size of ground graphs, even though ground graphs can be arbitrarily
large — that is, relational databases have no maximum size.
Next, we formally define the extend method, used internally for the construction of
abstract ground graphs. This method translates dependencies specified in the model into
dependencies in the abstract ground graph.
Definition 5.3 (Extending relational paths) Let Porig = [hT--,Ij] and let Pext =
[Ij, . . . , Ik] be two relational paths for schema S. The following three functions extend Porig
19
Maier, Marazopoulou, and Jensen
with Pext-
extend{Porig, Pext) = {P = concat{Porig[0 : length{Porig) - i + l],Pext[i ■ length{Pext)]) \
i £ pivots{reverse{Porig), Pext) A isValid{P)^
pivots{Pi,P2) = {i \ Pi[0 : i] = P2[0 : i]}
. f True if P does not violate Definition l43l
isV aliaiP) = < ^ , , ,
[ liaise otherwise
where concat, length, reverse, and inclusive-exclusive sublist are standard list functions.
Informally, the extend method constructs a set of valid relational paths from two input
relational paths. It first finds the indices of all items for which the input paths have a
common starting subpath (called pivots), and then it concatenates the two paths at each
pivot, removing one of the duplicated subpaths (see Example 5.2). Since (Z-separation re-
quires blocking all paths of dependence between two sets of variables, the extend method
is critical to ensure the soundness and completeness of our approach. The abstract ground
graph must capture all paths of dependence among the variable instances in the terminal
sets of relational variables for any represented ground graph. However, relational models
are specified solely with respect to canonical dependencies; the extend method is called re-
peatedly during the creation of an abstract ground graph, with Porg set to a given relational
path and Pext drawn from a canonical dependency.
Example 5.2 During the construction of the abstract ground graph j4GG_a/( employee 6
de-
picted in Figure [6j the extend method is called several times. First, all relational variables
for h = 6 from the Employee perspective are added as nodes in the graph. Next, extend is
used to insert edges corresponding to direct causes. Consider the node for [Employee, De-
velops, Product]. Success. The construction of j4GG^ employee 6 calls extend{Porig , Pext)
with Porig = [Employee, Develops, Product] and Pext = [Product, Develops, Em-
ployee] because [Product, Develops, EMPLOYEE].Competence—^ [Product]. Success G
V. Here, extend{Porig , Pext) = {[Employee], [Employee, Develops, Product, Devel-
ops, Employee]}, which leads to the insertion of two edges in the abstract ground graph.
Note that pivots{r ever se{Porig), Pext) = {1,2,3}, and the pivot at i = 2 yields the invalid
relational path [Employee, Develops, Employee]. □
We also describe two important properties of the extend method with the following two
lemmas. The first lemma states that every relational path produced by the extend method
yields a terminal set for some skeleton such that there is an item instance also reachable by
the two original paths. This lemma is useful for proving the soundness of our abstraction —
that all edges inserted in an abstract ground graph correspond to edges in some ground
graph.
Lemma 5.1 For any two relational paths Porig = [Ii, ■ ■ ■ , Ij] o-nd P^xt = [Ij, ■ ■ ■ ,Ik] with
P = extend{Porig, Pext) , VP e P there exists a relational skeleton a such that 3ii G (t(/i)
such that 3ik G P\i-^ and 3ij G Poriglh such that i^ G Pext\ij-
20
Independence in Models of Relational Data
Proof. See Appendix [K]
Example 5.3 Let a be the skeleton shown in Figure 1(b), let Porig = [Employee,
Develops, Product], let Pext = [Product, Develops, Employee], and let ii
Sally G (t(Employee). From Example 5.2, we know that P = extend{Parig, Pext)
{[Employee], [Employee, Develops, Product, Develops, Employee]}. We also
have [Employee] I Sally = {Sally} and [Employee, Develops, Product, Develops, Em-
ployee] [saiiy = {Quinn, Roger, Thomas}. By Lemma 5.1 , there should exist an ij £ Poriglh
for each of these employees (Quinn, Roger, Sally, and Thomas) such that they are in the ter-
minal set Pextliy We have Poriglsaiiy = {Laptop, Tablet}, and PextlLaptop = {Quinn, Roger,
Sally} and -Pext I Tablet = {Sally, Thomas}. So, the lemma clearly holds for this example. □
Lemma 5.1 guarantees that, for some relational skeleton, items in the terminal sets
produced by extend are also reachable in combination of the two input paths, Porig and
Pext- It is also possible (although infrequent) that there exist items reachable by Porig
and Pext that are not in the terminal set of any path produced with extend{Porig , Pext) ■
The following lemma describes this unreachable set of items, stating that there must exist
an alternative relational path P^^^g that intersects with Porig that, when using extend,
catches those remaining items. This lemma is important for proving the completeness of
our abstraction — that all edges in any ground graph are represented in the abstract ground
graph.
Lemma 5.2 For any relational skeleton a and two relational paths Porig = [A, • ■ • , Ij] o,nd
Pext = [/,-,..., 4] with P = extend{Porig,Pext), Vii G cr(/i) yij £ Porig\i^ yik G Pextlij if
yP £ P ik ^ -P|ii, then ^P'orig such that PorigW ^ P'origW / ^ '^''^d ik G -P'lii for some
P' G extend{P'^-g,Pext)-
Proof. See Appendix \K\
Example 5.4 Although Lemma |5.2| does not apply to the organization domain as cur-
rently represented, it could apply if either (1) there were cycles in the relational schema
or (2) if the path specifications on the relational dependencies included a cycle. Consider
additional relationships between employees and products. If employees could be involved
with products at various stages (e.g., research, development, testing, marketing), then there
would be alternative relational paths for which the lemma might apply. The proof of the
lemma in Appendix [A| provides abstract conditions describing when the lemma applies. □
5.2 Proof of Correctness
The correctness of our approach to relational d-separation relies on several facts: (1) d-
separation is valid for directed acyclic graphs (DAGs); (2) ground graphs are DAGs; and
(3) abstract ground graphs are directed acyclic graphs that represent exactly the edges that
could appear in all possible ground graphs. It follows that d-separation on abstract ground
graphs, augmented by intersection variables, is sound and complete for all ground graphs.
Additionally, we show that since relational d-separation is sound and complete, it is also
equivalent to the Markov condition for relational models. Using the previous definitions
and lemmas, the following sequence of results proves the correctness of our approach to
identifying independence in relational models.
21
Maier, Marazopoulou, and Jensen
Theorem 5.1 The rules of d- separation are sound and complete for directed acyclic graphs.
Proof. Due to Verma and Pearl (1988) for soundness and Geiger and Pearl (1988) for
completeness. ■
Theorem 5.1 implies that (1) all conditional independence facts derived by d-separation
on a Bayesian network hold in any faithful distribution represented by that model (sound-
ness) and (2) all conditional independence facts that hold in a faithful distribution can
be inferred from d-separation applied to the Bayesian network encoding that distribution
(completeness).
Lemma 5.3 For any acyclic relational model Ai and skeleton a, the ground graph GGmu
is a directed acyclic graph.
Proof. Due to both [Heckerman et al.| ( |2007[ ) for DAPER models and |Getoor| ( |200l[ ) for
PRMs. ■
Lemma |5.3| states that any ground graph of an acyclic relational model is essentially a
very large Bayesian network. By Theorem 5.1, d-separation is sound and complete when
applied directly to a ground graph. However, Definition 5.1 explicitly states that rela-
tional d-separation must hold across all possible ground graphs, which is the reason for
constructing the abstract ground graph representation. Next, we introduce the notion of
(B, /i)-reachability, which describes the conditions under which we can expect an edge in a
ground graph to be represented in an abstract ground graph.
Definition 5.4 {{B, /i)-reachability) Let GGMa be the ground graph for some relational
model Ai and skeleton a. Then, ik.Vi — )• ij.V2 G GGMa is {B, h)-reachahle for perspective
B and hop threshold h if there exist relational variables Pk-Vi = [i?, . . . , and Pj.V2 =
[i?, . . . , /j].V2 such that length[Pk) < /i + 1, length{Pj) < /i + 1, and there exists a b £ a{B)
where ik G Pk\b and ij G -Pj lb-
Example 5.5 Consider the ground graph shown in Figure 2(b) Let perspective B be Em-
ployee, and let the hop threshold h = 6. Let ik-Vi — )• ij.V2 be the edge Laptop. S'uccess — )•
Devices. -Refenue in the ground graph. Then, we can show that this edge is {B, /i)-reachable
because of the following: Set Pk-Vi = [Employee, Develops, Product]. Success, Pj.V2 =
[Employee, Develops, Product, Funds, Business-Unit]. iZeuenne, and let b = Sally
G (t(Employee). We have length{Pk) = 3 < 7, length(Pj) = 5 < 7, LaptopG Pfclsaiiy, and
DevicesG Pjisaiiy □
Since Definition 5.4 pertains to edges reachable via a particular perspective B and hop
threshold h, it relates to the reachability of edges in abstract ground graphs. Specifically,
-jcr, we can derive a set
Definition 5.4 implies that (1) for any edge in ground graph GGmc
of abstract ground graphs for which that edge is {B, /i)-reachable, and (2) for any abstract
ground graph AGGmbk-, we can derive the set of (B, /i)-reachable edges for a given ground
graph. Given (5, /i)-reachability, we can now express the soundness and completeness of
abstract ground graphs, which is depicted graphically in Figure [7j
22
Independence in Models of Relational Data
{B, /i)-soundness
AGGMBh ^ * GGMa
{B, /i)-completeness
Figure 7: A relational model A4 combined with a perspective B and a hop threshold h
yields the abstract ground graph AGGj^Bh- The model combines with a skeleton
a to produce the ground graph GGmct- In Theorem 5.2 we show that abstract
ground graphs are both sound and complete with respect to {B, /i)-reachability.
Theorem 5.2 For any acyclic relational model M, perspective B € £yjTZ, and hop thresh-
old h G N*^, the abstract ground graph AGGj^Bh is [B, h)-reachahly sound and complete for
any ground graph GGjUa for all skeletons a.
Proof. See Appendix [A)
Theorem |5.2| guarantees that, up to the hop threshold for a given perspective, abstract
ground graphs capture all possible paths of dependence between any pair of variables in
any ground graph. The details of the proof provide the reasons why explicitly representing
the intersection between pairs of relational variables is necessary for ensuring a sound and
complete abstraction.
Theorem 5.3 For any acyclic relational model M, perspective B ^ £\JTZ, and hop thresh-
old /i G N'^, the abstract ground graph AGGj^iBh is directed and acyclic.
Proof. Let M be an arbitrary acyclic relational model, let G i5 U 7^ be an arbitrary
perspective, and let /i G N'^ be an arbitrary hop threshold. It is clear that every edge
in the abstract ground graph AGGMBh is directed by construction in Definition 5.2 All
edges inserted in any abstract ground graph are drawn from the directed dependencies in
a relational model. Since Ai is acyclic, they class dependency graph Gm is also acyclic by
Definition 4.9, Assume for contradiction that there exists a cycle of length n in AGGMBh
that contains both relational variables and intersection variables. By Definition 5.2 , all edges
inserted in AGGmbh are drawn from some dependency in A4, even for nodes corresponding
to intersection variables. Retaining only the final item class in each relational path for every
node in the cycle will yield a cycle in G^ by Definition 4.9 So, A4 could not have been
acyclic, which contradicts the assumption. ■
Theorem |5.3| ensures that the standard rules of c?-separation can apply directly to ab-
stract ground graphs because they are acyclic given an acyclic model. We now have suf-
ficient supporting theory to prove that d-separation on abstract ground graphs is sound
and complete. In the following theorem, we define W as the set of nodes augmented
with their corresponding intersection nodes for the set of relational variables W: W =
W U UvKewi^ CiW I W nW is an intersection node in AGGMBh}- We also say that
d-separation holds up to a specified hop threshold h if there are no d-connecting paths
involving a relational path (for some relational variable) of length greater than /i + 1.
23
Maier, Marazopoulou, and Jensen
Theorem 5.4 Relational d-separation is sound and complete for abstract ground graphs up
to a specified hop threshold. Let X, Y, and Z be three distinct sets of relational variables
for perspective B £ £ U TZ defined over relational schema S. Then, for any skeleton a and
for all b E (t{B), X|ft and Yj^ are d-separated by Z|b up to hop threshold h in ground graph
GGjiia if o-nd only ifX. and Y are d-separated by Z on the abstract ground graph AGGmbh-
Proof. We must show that d-separation on an abstract ground graph imphes d-separation
on all ground graphs it represents (soundness) and that d-separation facts that hold across
all ground graphs are also entailed by d-separation on the abstract ground graph (com-
pleteness). Let M be an arbitrary acyclic relational model, and let X, Y, and Z be three
arbitrary distinct sets of relational variables for perspective B G £\JTZ defined over relational
schema S.
Soundness : Assume that X and Y are d-separated by Z on AGG^mbh for some hop
threshold h G N''. Assume for contradiction that there exists an item b G o"(-B) such that
X|f, and Ylft are not d-separated by Z|{, in the ground graph GGm<t foi' some arbitrary
skeleton a. Then, there must exist a d-connecting path p from some x G X|{| to some
y G 'Y\h given all z G 'L\h- By Theorem 5.2, AGGj^Bh is (S, /i)-reachably complete, so all
(-B, /i)-reachable edges in GGMa are captured by edges in AGGmbh- So, path p must be
represented from all nodes in {n | x G to all nodes in {n | y G n\b} in AGGmbh- If
p is d-connecting in GG^a, then it is d-connecting in AGGj^sh: implying that X and Y
are not d-separated by Z. So, X|;, and Y|f, must be d-separated by Z\b.
Completeness: Assume that X|fe and Y|f, are d-separated by Z|fe in the ground graph
GGMa for all skeletons a for all b G cr{B). Assume for contradiction that X and Y are
not d-separated by Z on AGG^iBh for some hop threshold h G N". Then, there must
exist a d-connecting path p for some relational variable AT G X to some y G Y given all
Z G Z. By Theorem 5.2, AGGjuBh is (S, /i)-reachably sound, so every edge in AGG_MBh
must correspond to some pair of variable instances in some ground graph. So, if p is d-
connecting in AGGj^sh^ then there must exist some skeleton a such that p is d-connecting
in GGmct for some b G cr{B), implying that d-separation does not hold for that ground
graph. So, X and Y must be d-separated by Z on AGGmbh- *
Theorem 5.4 proves that d-separation on abstract ground graphs is a sound and complete
solution to identifying independence in relational models. Given Theorem 5.1, we can also
say that the set of conditional independence facts derived from d-separation on abstract
ground graphs is exactly the same (up to a specified hop threshold) as the set of conditional
independencies in common with all faithful distributions represented by all possible ground
graphs.
Additionally, we can show that relational d-separation is equivalent to the Markov
condition on relational models. In the following definition, let nd{X) be the non-descendant
variables of X and let pa{X) be the set of parent variables of X .
Definition 5.5 (Relational Markov condition) Let X = [B, . . . ,Ik\.V be a relational
variable for perspective B £ £ U TZ defined over relational schema S. Then, for relational
model Ai, P{X \nd{X),pa{X)) = P{X \ pa{X)) if and only if, for any skeleton a, Vx G
A|fe P{x I nd{x),pa{x)) = P{x \ pa{x)) in ground graph GGm<t for all b G cr{B).
24
Independence in Models of Relational Data
In other words, a relational variable X is independent of its non-descendants given its
parents if and only if, for any possible ground graph, the Markov condition holds for all
instances of X. For Bayesian networks, the Markov condition is equivalent to d-separation
(Neapolitan, 2004). Because ground graphs are Bayesian networks (by Lemma 5.3) and rela-
tional (i-separation on abstract ground graphs is sound and complete (by Theorem 5.4), we
can conclude that relational d-separation is equivalent to the relational Markov condition.
6. NaiVe Relational d-separation is Frequently Incorrect
If the rules of ^-separation for Bayesian networks were applied directly to relational models,
how frequently would the conditional independence conclusions be correct? In this section,
we evaluate the necessity of our approach to relational d-separation, namely the use of
abstract ground graphs. We empirically test the consistency of a naive approach to relational
d-separation against our sound and complete solution over a large space of synthetic causal
models. In order to do so, we first formally define how d-separation could actually be applied
directly to the structure of a relational model. Consider the following limited definition
of relational paths, which itself limits the space of models and conditional independence
queries.
Definition 6.1 (Simple relational path) A simple relational path [/i,...,/^] for rela-
tional schema S is an alternating sequence of entity and relationship classes /i, . . . , G £\JTZ
such that Ii ^ ■ ■ ■ ^ Ik and for all j > 1 (1) if Ij S £, then Ij^i G TZ with Ij € Ij-i and (2)
if Ij € IZ, then G £ with S Ij.
The sole difference between relational paths (Definition 4.3) and simple relational paths
is that no item class may appear multiple times along a simple relational path. This yields
paths drawn directly from a schema diagram. For the example in Figure 1(a), [Employee,
Develops, Product] is simple whereas [Employee, Develops, Product, Develops,
Employee] is not.
Additionally, we define simple relational schemas such that for any two item classes
Ij,Ik £ U TZ, there exists at most one simple relational path between them, that is, no
cycles occur in the schema diagram. The example in Figure 1(a) is a simple relational
schema. The restriction to simple relational paths and schemas yields similar definitions
for simple relational variables, simple relational dependencies, and simple relational models.
The example in Figure 2(a) is a simple relational model because it consists of only simple
relational dependencies.
A first approximation to relational d-separation would be to apply the rules of d-
separation for Bayesian networks directly to the graphical representation of relational mod-
els. This is equivalent to constructing the class dependency graph of Definition 4.9 from
the relational model and applying d-separation. The class dependency graph for the model
in Figure 2(a) is shown in Figure 8(a) Note that the class dependency graph removes
perspectives, ignores path designators on dependencies, does not include all implications
of dependencies among arbitrary relational variables, and does not consider intersection
variables.
Although the class dependency graph is independent of perspectives, testing any condi-
tional independence fact requires choosing a perspective. All variables must have a common
25
Maier, Marazopoulou, and Jensen
([ employee .Compefence )— —►( ^ PRODUCT.Success J )— — ►( juSINESS-UNIT.Kerenue )
EMPLOYEE. Salary J)^ (^BUSINESS-UNIT.fiudgeT)
^MPLOYEEl.Compefenre )— ►(^E MPLOYEE, DEVELOPS, PRODUCTl.Succe S^)— ►Cj ^PLOYEE, DEVELOPS, PRODUCT, FUNDS, BUSINESS-UNIT]. Keren ^^
^ I ^ J" I _
[EMPLOYEE], Sa/aryJ) 4 CJEMPLOYEE, DEVELOPS, PRODUCT, FUNDS, BUSINESS-UNIT]. BudgeT^
(; ^ODUCT, DEVELOPS, EMPLOYEE]. Competen ce^)— ► ^[PRODUCT].Sncces 7)——^<|P RODUCT, FUNDS, BUSINESS-UNIT]. gei/enu ?)
^ I ^ ^ T ^
^~lPRODUCT, DEVELOPS, EMPLOYEE].5a/ary~^ < ([jPRODUCT, FUNDS, BUSINESS-UN]T].Budg^^
<JS USINESS-UNIT, FUNDS, PRODUCT, DEVELOPS, EMPLOYEE]. Compefe nce>——^<[[B USINESS-UNIT, FUNDS, PRODUCT], SuccS ?;)——^( lBUSINESS-UNIT].«erenue )
C^[] jBUSINESS-UNIT, FUNDS, PRODUCT, DEVELOPS, EMPLOYEEI.Satary ]^ < ^ [BUSINESS-UNIT].Budgef^
(b)
Figure 8: For the model in Figure 2(a), (a) the class dependency graph and (b) three
simple abstract ground graphs for the Employee, Product, and Business-
Unit perspectives.
base item; otherwise, no method can produce a single consistent, propositional table from
a relational database. For example, consider the construction of a table describing employ-
ees with columns for their salary, the success of products they develop, and the revenue
of the business units they operate under. This procedure requires a join across three re-
lational variables ( [Employee]. S'a/ary, [Employee, Develops, Product]. Success, and
[Employee, Develops, Product, Funds, Business-Unit]. iJeuenue) for every common
base item instance, from Paul to Thomas. Therefore, given a perspective and the space of
simple relational schemas and models, a class dependency graph is equivalent to a simple
abstract ground graph.
Definition 6.2 (Simple abstract ground graph) For simple relational model M =
{S,V), perspective B £ £ UTZ, and hop threshold h e N*^, the simple abstract ground graph
AGGji^^^ is the directed acyclic graph {V, E) that abstracts the dependencies V among
simple relational variables. The set of nodes is the union of all simple relational variables
[[B,...,Ij\.V \ B ^ ■■■ ^ Ij A length{[B, . . . , Ij]) < h + l} , and the set of ed ges connects
simple relational variables {[B,..., Ik\-Vi [B,. .., Ij]-V2 \ [Ij,-- - , h]-Vi [-/j]-V2 eV A
[B,...,Ik] e extend{[B,...,I,],[Ij,...,h]) A [B, . . . , Ik]-Vi,[B, . . . , Ij].V2 G V}.
Simple abstract ground graphs only include nodes for simple relational variables and
necessarily exclude intersection variables. Lemma 4.1 only applies to pairs of simple rela-
tional paths if the schema has cycles, which is not the case for simple relational schemas.
As a result, for all possible perspectives of a given schema and model, all simple abstract
26
Independence in Models of Relational Data
ground graphs consist of the same number of nodes and edges when the hop threshold is
sufficiently large and with the path designator on relational variables redefined from the
given perspective. Figure |8(b)| shows three simple abstract ground graphs from unique
perspectives for the model in Figure |2(a) As noted above, simple abstract ground graphs
are qualitatively the same as the class dependency graph, but this representation enables
answering relational ^-separation queries, which requires a common perspective.
Remark 6.1 The nai've approach to relational d-separation derives conditional indepen-
dence facts from simple abstract ground graphs (Definition 6.2). The sound and complete
approach described in this paper applies (f-separation, augmenting the variable sets with
their intersection variables, to "regular" abstract ground graphs, as described by Defini-
tion 5.2 Clearly, if (i-separation on a simple abstract ground graph claims that X is
d-separated from Y given Z, then d-separation on the regular abstract ground graph yields
the same conclusion if and only if there are no d-connecting paths on the regular abstract
ground graph. Either X and Y can be d-separated by a set of simple relational variables Z,
or they require non-simple relational variables — those involving relational paths with cycles
or intersection variables |£] □
To evaluate the necessity of regular abstract ground graphs (i.e., the additional paths in-
volving non-simple relational variables and intersection variables) , we devised an experiment
to determine the frequency of equivalence of d-separation on simple and regular abstract
ground graphs. The two approaches are only equivalent if a minimal separating set consists
entirely of simple relational variables. Thus, for a given pair of relational variables X and
y, we test the following:
1. Is either X or Y & non-simple relational variable?
2. Are X and Y marginally independent?
3. Does a minimal separating set Z d-separate X and y, where Z consists solely of
simple relational variables?
4. Is there no separating set Z that d-separates X and Yl
If the answer to (1) is yes, then the naive approach cannot apply since either X or y
is not defined on the simple abstract ground graph. If the answer to (2) is yes, then there
is equivalence; this is a trivial case because there are no d-connecting paths for Z = 0. If
the answer to (3) is yes, then there is a minimal separating set Z consisting of only simple
relational variables. In this case, the simple abstract ground graph is sufficient, and we
have equivalence. If the answer to (4) is yes, then no separating set Z, simple or otherwise,
renders X and Y conditionally independent.
We randomly generated a schema and model for 100 trials for each setting using the
following parameters:
8. The theoretical conditions under which equivalence occurs are sufficiently complex that they provide
little utility as they essentially require reconstructing the regular abstract ground graph and checking a
potentially exponential number of dependency paths.
27
Maier, Marazopoulou, and Jensen
• Number of entity classes, ranging from 1 to 4.
• Number of relationship classes fixed to one less than the number of entities, allowing
only simple relational schemas. Relationship cardinalities are randomly selected.
• Number of attributes for each entity and relationship class randomly drawn from a
shifted Poisson distribution with mean A = 1.0 (~ Pois(l.O) + 1).
• Number of dependencies in the model, ranging from 1 to 10.
Then, for all pairs of relational variables with a common perspective for hop threshold
/i = 4, we ran the aforementioned tests against the abstract ground graph with hop threshold
h = 8.
This procedure generated a total of almost 3.6 million pairs of relational variables to test.
Approximately 56% included a non-simple relational variable; the nai've approach cannot
be used to derive a conditional independence statement in these cases, requiring the full
abstract ground graph in order to represent these variables. Of the remaining 44% (roughly
1.6 million), 82% were marginally independent, and 9% were not conditionally independent
given any conditioning set Z. Then, of the remaining 9% (roughly 145,000), we can test
the frequency of equivalence for conditional independence facts with non-empty separating
sets — the percentage of cases for which only simple relational variables are required in a
minimal separating set Z.
Figure [9] shows this frequency for schemas of increasing numbers of entity classes (1-4) for
varying numbers of dependencies in the causal model (1-10). Since relational schemas with
a single entity class and no relationships describe propositional data, the simple abstract
ground graph is equivalent to the full abstract ground graph, which is also equivalent to the
model itself. In this case, the nai've approach is always equivalent because it is exactly d-
separation on Bayesian networks. For truly relational schemas (with more than one entity
class and at least one relationship class), the frequency of equivalence decreases as the
number of dependencies in the model increases. Additionally, the frequency of equivalence
decreases more as the number of entities in the schema increases. For example, the frequency
of equivalence for nine dependencies is 60.3% for two entities, 51.2% for three entities, and
43.2% for four entities.
There are several factors that influence the frequency of equivalence. We learned a linear
regression model with interaction terms and indicator variables for the number of entities,
dependencies, and many cardinalities on relationships. In predicting the raw number of
non-equivalent conditional independencies (those requiring non-simple relational variables
in the separating set), the strongest predictor is the interaction between the number of
many cardinalities and the number of dependencies. This suggests that the more depen-
dencies in the model and the more many cardinalities, the more non-equivalent conditional
independencies will occur. Thus, the frequency of equivalence illustrated in Figure [9] is
only so high because the experiment averages across randomly chosen schemas with ran-
domly selected relationship cardinalities. When there are few many cardinalities, the model
can be more closely approximated by a Bayesian network, for which we expect the naive
approach to be more equivalent. In contrast, in predicting the raw number of equivalent
conditional independencies (those with only simple relational variables in the separating
28
Independence in Models of Relational Data
Figure 9: Of all generated relational (^-separation queries, 56% are not representable with
the naive approach. Of the 44% that are representable (involving only simple
relational variables), 82% are marginally independent and 9% are dependent.
Pairs of relational variables in the remaining 9% are conditionally independent
given a non-empty separating set [X _LL y | Z, where Z 7^ 0). We test whether the
conditioning set consists solely of simple relational variables. If so, then the naive
approach to relational d-separation is equivalent to d-separation on fully specified
abstract ground graphs. This graph plots to the frequency of equivalence across
schcmas with increasing numbers of entity classes (1-4) for varying numbers of
dependencies (1-10). For schemas with more than one entity class, the frequency
of equivalence decreases as the number of dependencies increases. Shown with
95% confidence intervals.
set), the strongest predictors are non-interaction terms: the number of entities (positive),
the number of dependencies (positive), and the number of MANY cardinalities (negative).
Conditional independencies occur more frequently when there are more entities and de-
pendencies; however, non-simple relational variables occur more frequently when there are
more MANY cardinalities.
This experiment suggests that applying d-separation directly to a relational model will
frequently derive incorrect conditional independence facts. Additionally, there is a large
class of conditional independence queries involving non-simple parent or child variables
for which such an approach is undefined. Together, this indicates that fully specifying
29
Maier, Marazopoulou, and Jensen
abstract ground graphs and applying d-separation augmented with intersection variables
(as described in Section [s]) is critical for accurately deriving most conditional independence
facts from relational models.
7. Experiments
To complement the theoretical results, we present a series of three experiments on synthetic
data. The primary goal of these empirical results is to demonstrate the feasibility of applying
relational d-separation in practice. The experiment in Section 7.1 describes the factors
that influence the size of abstract ground graphs and thus the computational complexity
of relational ^-separation. The experiment in Section 7.2 evaluates the growth rate of
separating sets produced by relational d-separation as abstract ground graphs become large.
The results indicate that minimal separating sets grow more slowly than abstract ground
graphs. The experiment in Section 7.3 tests how the expectations of the relational d-
separation theory match statistical conclusions on simulated data. As expected from the
proofs of correctness in Section 5.2 the results indicate a close match, aside from Type I
errors and certain biases of conventional statistical tests on relational data.
7.1 Abstract Ground Graph Size
Abstract ground graphs (AGGs) explicitly represent the intersection among relational vari-
ables and extend the canonically specified dependencies of relational models. Consequently,
it is important to quantify the size of an AGG and determine which factors influence its
size. We randomly generated relational schemas and models for 1,000 trials of each setting
using the following parameters:
• Number of entity classes, ranging from 1 to 4.
• Number of relationship classes, ranging from to 4. The schema is guaranteed to
be fully connected (e.g., schemas with three entity classes require at least two rela-
tionships) and includes at most a single relationship between a pair of entities (e.g.,
schemas with two entity classes must have exactly one relationship class). Relation-
ship cardinalities are selected uniformly at random.
• Number of attributes for each entity and relationship class randomly drawn from a
shifted Poisson distribution with mean A = 1.0 (~ Pois{1.0) + 1).
• Number of dependencies in the model, ranging from 1 to 15.
This procedure generated a total of 450,000 abstract ground graphs across all perspec-
tives, including all entity and relationship classes for each experimental combination. We
measure size as the number of nodes and edges appearing in a given abstract ground graph.
depicts how AGG size varies with respect to the number of many cardi-
Figure 10(a^
nalities in the schema (fixed for models with 10 dependencies), and Figure 10(b) shows how
it varies with respect to the number of dependencies in the model. Note that for a single
entity, abstract ground graphs are equivalent to Bayesian networks. In order to determine
the most infiuential factors of AGG size, we ran log-linear regression using independent
30
Independence in Models of Relational Data
# Nodes (ent. persp.)
# Edges (ent. persp.)
# Nodes (rel. persp.)
# Edges (rel. persp.)
2 4 6 8
Number of many cardinalities
(a)
N CD
o
O O
< g
2 4 6 8 10 12 14
Number of dependencies
(b)
Figure 10: AGG size variation as (a) the number of many cardinalities in the schema in-
creases (dependencies fixed at 10) and (b) the number of dependencies increases.
Shown with 95% confidence intervals.
variables that describe only the schema and model. Detailed results are provided in Ap-
pendix [C} This analysis indicates: (1) As the number of entities, relationships, attributes,
and MANY cardinalities increases, the AGG grows at an exponential rate with respect to
both nodes and edges. (2) As the number of dependencies in the model increases, the
number of edges increases linearly, but the number of nodes is invariant. (3) AGGs with re-
lationship perspectives are larger than entity perspectives because more relational variables
can be defined.
7.2 Minimal Separating Set Size
Because abstract ground graphs can become large, one might expect that separating set^
would also grow to impractical sizes. Fortunately, relational d-separation produces minimal
separating sets that are empirically observed to be small. We ran 1,000 trials of each setting
using the following parameters:
• Number of entity classes, ranging from 1 to 4.
• Number of relationship classes fixed to one less than the number of entities. Relation-
ship cardinalities are selected uniformly at random.
• Total number of attributes across entity and relationship classes fixed to 10.
• Number of dependencies in the model, ranging from 1 to 10.
9. If X and Y are rf-separated given Z, then Z is a separating set for X and Y. A separating set Z is
minimal if there is no proper subset of Z that is also a separating set.
31
Maier, Marazopoulou, and Jensen
_ o
CD
E
E
O
N
CD ^
(/)
i ^
CD
Q.
Dependencies: 5
Entities: 1
10 1 5 10 1 5 10 1 5 10
2 3 4
Figure 11: Minimal separating sets have reasonable sizes, growing only with respect to
model density.
For each relational model, we identified one minimal separating set for up to 100 randomly
chosen pairs of conditionally independent relational variables. This procedure generated
almost 2.5 million pairs of variables.
To identify a minimal separating set between relational variables X and Y, we modified
Algorithm 4 devised by Tian et al. (1998) by starting with all parents of X and Y, the
variables augmented with the intersection variables they subsume in the abstract ground
graph. While the discovered separating sets are minimal, they are not necessarily of min-
imum size. Figure 11 shows the frequency of separating set size as both the number of
entities and dependencies vary. In summation, roughly 83% are marginal independencies
(having empty separating sets), 13% have separating sets of size 1, and less than 0.1%
have separating sets with more than 5 variables. The experimental results indicate that
separating set size is strongly influenced by model density, primarily because the number
of potential d-connecting paths increases as the number of dependencies increases.
7.3 Empirical Validity
As a practical demonstration, we examined how the expectations of the relational d-
separation theory match the results of statistical tests on actual data. We parameterized
relational models in a way that produced linear effects with equal contribution from each
direct cause and standard methods for aggregating terminal sets of relational variables. We
used simple additive linear equations with independent, normally distributed error, and
the average aggregate for terminal sets of relational variables. If Y has no parents, then
Y = e ~ A^fO, 1), and Y = > ; — —avq(X) + .le otherwise. Relational skeletons
X£par(Y)
were constructed such that relationship instances were produced via a latent homophily
process, similar to the method used by Shalizi and Thomas (2011). Each entity instance
received a single latent variable, marginally independent from all other variables. The prob-
ability of any relationship instance was drawn from -j , the inverse logistic function,
1 + e
where d = \Lei — Le2\, the difference between the latent variables on the two entities, and
a = 10, set as the decay parameter. We also scaled the probabilities in order to produce
32
Independence in Models of Relational Data
CO
LLj
a:
LJJ
O
Q
LU
<
LU
CO
Q
O in
£ §
c
to LO
. o
<
"I 1 1 r
1 entity 2 entities 3 entities 4 entities
o
'c
W
d.
o
r
1 entity 2 entities
r
3 entities 4 entities
CO
UJ
a:
LU
O
Q
LU
I-
o
LU
o
o
Q
O
O m
£ §
C
0)
W m
. o
1^
1 1 1 1
1 entity 2 entities 3 entities 4 entities
CO
o
'c
W
Q.
P
1 entity
1 1 1
2 entities 3 entities 4 entities
Figure 12: The relational d-separation theory closely matches the results of statistical tests
on actual data.
an expected degree of 5 for each entity instance when the cardinality of the relationship is
MANY. Since the latent variables arc marginally independent of all others, they are safely
omitted from abstract ground graphs; their sole purpose was to generate relational skele-
tons, providing a greater probability of non-empty intersection variables as opposed to a
random underlying link structure.
For 100 trials, we randomly generated a relational schema and model using the following
parameter settings:
• Number of entity classes, ranging from 1 to 4.
• Number of relationship classes fixed to one less than the number of entities. Relation-
ship cardinalities are selected uniformly at random.
• Number of attributes for each entity and relationship class randomly drawn from a
shifted Poisson distribution with mean A = 1.0 (~ Pozs(l.O) + 1).
33
Maier, Marazopoulou, and Jensen
• Number of dependencies in the model fixed to 10.
We then tested up to 100 true and false relational d-separation queries across 100 skeletons
(i.e., instantiated relational databases) with 1,000 instances of each entity class. To test
for conditional independence between X and Y given Z {X _LL y | Z ?), we tested the
t-statistic for the coefficient of avg{X) in the equation y = /3o + (3iavg{X) + j3iavg{Zi).
The queries were restricted such that X and Y are single relational variables, and the
relational path for y is a singleton. These queries correspond to testing potential canonical
(direct causal) dependencies.
For each query, we recorded the average strength of effect (as measured by the squared
partial correlation coefficient — the proportion of remaining variance of Y explained by X
after conditioning on Z) and the proportion of trials for which each query was significant
(a = 0.01 adjusted with Bonferroni correction with the number of queries per trial). Fig-
ure 12 shows the distribution of the average strength of effect and proportion of significant
trials across both true and false queries for varying numbers of entities. The graph uses a
standard box-and- whisker plot with values greater or less than 1.5 times the inner quartile
range — the difference between the upper and lower quartiles — marked as outliers.
In the vast majority of cases, relational d-separation is consistent with tests on actual
data. For approximately 23,000 true queries, 14.9% are significant in more than one trial,
but only 2.2% have an average effect size greater than 0.01. Aside from Type I error, a
small number of cases exhibit an interaction between aggregation and relational structure
(i.e., degree or the cardinality of terminal sets of relational variables), and even small effects
produce significant results with a large sample size. This interaction violates the identically
distributed assumption of data instances, which produces a biased estimate of effect size
for simple linear regression. Linear regression does not account for these interaction ef-
fects, suggesting the need for more accurate statistical tests of conditional independence for
relational data.
8. Discussion
In this paper, we extend the theory of d-separation to graphical models of relational data.
We present the abstract ground graph, a new representation that is {B, /i)-reachably sound
and complete in its abstraction of dependencies across all possible ground graphs of a given
relational model. We formally define relational ^-separation and offer a sound, complete,
and computationally efficient approach to deriving conditional independence facts from re-
lational models by exploiting their abstract ground graphs. We also show that relational
d-separation is equivalent to the Markov condition for relational models. We provide an
empirical analysis of relational d-separation on synthetic data, demonstrating a close cor-
respondence between the theory and statistical results in practice. Finally, we evaluate
how frequently the complexity of abstract ground graphs proves necessary for accurately
deriving conditional independence facts.
The results of this paper imply potential flaws in the design and analysis of some real-
world studies. If researchers of social or economic systems choose inappropriate data and
model representations, then their analyses may omit important classes of dependencies.
Specifically, our theory implies that choosing a propositional representation from an in-
34
Independence in Models of Relational Data
Figure 13: (a) Two models of a bivariate relational domain with opposite directions of
causality for a single dependency (relationship class omitted for simplicity) ; (b)
a single dependency implies additional dependencies among arbitrary relational
variables, shown here in a fragment of the abstract ground graph for B's per-
spective; (c) an example relational skeleton; and (d) the ground graphs resulting
from applying the relational model to the skeleton.
herently relational domain may lead to serious errors. An abstract ground graph from a
given perspective defines the exact set of variables that must be included in any propo-
sitionalization. The absence of any relational variable (including intersection variables)
may unnecessarily violate causal sufficiency, which could result in the inference of a causal
dependency where conditional independence was not detected. Our work indicates that
researchers should carefully consider how to represent their domains in order to accurately
reason about conditional independence.
The abstract ground graph representation also presents an opportunity to derive new
edge orientation rules for algorithms that learn the structure of relational models, such as
RPC (Maier et al. , 2010). There are unique orientations of edges that are consistent with
a given pattern of association that can only be recognized in an abstract ground graph.
For example, in contrast to bivariate IID data, it is simple to establish the direction of
causality for bivariate relational data. Consider the two bivariate, two-entity relational
models depicted in Figure [l3|| a). The first model implies that values of X on A entities are
caused by the values of Y on related B entities. The second model implies the opposite,
that values of y on i? entities are caused by the values of X on related A entities. For
simplicity, we show the relationship class only as a dashed line between entity classes and
omit it from relational paths.
Figure 13 'b) illustrates a fragment of the abstract ground graph (for hop threshold /i=4)
that each of the two relational models implies. As expected, the directions of the edges in
the two abstract ground graphs are counterposed. Both models produce observable statisti-
cal dependencies for relational variable pairs ([i?].y, [-B, and {\B,A\.X, \B,A,B].Y).
However, the relational variables [-B].y and [B,A,B].Y have different observable statisti-
35
Maier, Marazopoulou, and Jensen
cal dependencies: In the first model, they are marginally independent and conditionally
dependent given [i?,74].X, and in the second model, they are marginally dependent and
conditionally independent given [5,^4]. X. As a result, we can uniquely determine the di-
rection of causality of the single dependence by exploiting relational structure. (There is
symmetric reasoning for relational variables from A's perspective, and this result is also
applicable to ONE-to-MANY data.)
To illustrate this fact more concretely, consider the small relational skeleton shown in
Figure 13 c) and the ground graphs applied to this skeleton in Figure 13 ^d). In the first
ground graph, we have yi _LL y2 and yi -ijL y2 \xi, but in the second ground graph, we have
yi J/ 2/2 and yi -LL 2/2 \xi. These opposing conditional independence relations uniquely
determine the correct causal model.
Deriving and formalizing the implications of relational d-separation is a main direction
of future research. Additionally, our experiments suggest that more accurate tests of con-
ditional independence for relational data need to be developed, specifically tests that can
address the interaction between relational structure and aggregation across terminal sets of
relational variables. This work has also focused solely on relational models of attributes;
future work should consider models of relationship and entity existence to fully characterize
generative models of relational structure. Finally, the theory could also be extended to
incorporate functional or deterministic dependencies, as Z)-separation extends d-separation
for Bayesian networks.
Acknowledgments
The authors wish to thank Cindy Loiselle for her editing expertise. This effort is sup-
ported by the Intelligence Advanced Research Project Agency (lARPA) via Department of
Interior National Business Center Contract number D11PC20152, Air Force Research Lab
under agreement number FA8750-09-2-0187 and Science Applications International Corpo-
ration (SAIC) and DARPA under contract number P010089628. The U.S. Government is
authorized to reproduce and distribute reprints for governmental purposes notwithstanding
any copyright notation hereon. The views and conclusions contained herein are those of
the authors and should not be interpreted as necessarily representing the official policies or
endorsements either expressed or implied, of lARPA, DoI/NBC, AFRL, SAIC, DARPA or
the U.S. Government. Katerina Marazopoulou received scholarship support from the Greek
State Scholarships Foundation.
Appendix A.
In this appendix, we provide detailed proofs for all previous lemmas, theorems, and corol-
laries.
Lemma 14.11 For any schema S and any two relational paths Pi = [Ii, . . . , Im, ■ ■ ■ , Ik] dnd
P2 = [h, . . . , In, ■ ■ ■ , Ik] with Im In, there exists a skeleton a such that Pi\ij^ Ci P2\ii 7^
for some ii G cy{Ii)-
Proof. Proof by construction. Let S be an arbitrary schema with two arbitrary relational
paths Pi = [/i, . . . , /m, . . . , /fc] and P2 = [Ii, 4] where / In- We wiU con-
36
Independence in Models of Relational Data
Figure 14: Schematic of two relational paths Pi and P2 for which Lemma 4.1 guarantees
that some skeleton a yields a nonempty intersection of their terminal sets. The
example depicts a possible constructed skeleton based on the procedure used in
the proof of Lemma |4.1[
struct a skeleton a such that the terminal sets for item ii £ along Pi and P2 have a
nonempty intersection, that is, an item G Pi\i^(^P2\ii / O (roughly depicted in Figure 14).
We use the following procedure to build a:
1. Simultaneously traverse Pi and P2 from Ii until the paths diverge. For each entity
class E £ £ reached, add a unique entity instance e to cr{E).
2. Simultaneously traverse Pi and P2 backwards from until the paths diverge. For
each entity class E £ £ reached, add a unique entity instance e to cr{E).
3. For the divergent subpaths of both Pi and P2, add unique entity instances for each
entity class E £ £.
4. Repeat 1-3 for relationship classes. For each R £ TZ reached, add a unique relationship
instance r connecting the entity instances from classes on Pi and P2, and add unique
entity instances for classes E £ R not appearing on Pi and P2.
This process constructs an admissible skeleton — all instances are unique and this process
assumes no cardinality constraints aside from those required by Definition 4.3 By construc-
tion, there exists an item ii £ cr{Ii) such that Pi|ji n P2\ii = {ik} 7^ 0- ■
Lemma 15.11 For any two relational paths Porig = [Ii, - ■ ■ , Ij\ o,nd Pext = [-(;■>••• > Ik] with
P = extend{Porig, Pext) , VP £ P there exists a relational skeleton a such that 3ii £
such that 3ik £ P|j^ and 3ij £ Porig\ii such that i^ £ Pext\ij-
Proof. Let Porig = Vii ■ ■ ■ i^j] ^-ncl P^xt = [Ijt ■ ■ ^Ik\ be two arbitrary relational paths,
and let P = extend{Porig , Pext) ■ Let c £ pivots{r ever se{Porig), Pext) be an arbitrary pivot
such that P = concat{Porig[0 '■ length{Porig) — c + l],Pe2^f[c : length{Pext)]) ■ Since P E P,
we know that P is a valid relational path, and we can ignore pivots that produce invalid
paths. There are two subcases:
37
Maier, Marazopoulou, and Jensen
Figure 15: Example construction of a relational skeleton for two relational paths Porig =
[Ii, . . . , Im ■ ■ ■ , Ij] and Pext = [Ij, ■ ■ ■ , I'm ■ ■ ■ , Ik], where item class Im is repeated
between Im and Ij. This construction is used within the proof of Lemma 5.1
(a) c = 1 and P = [Ii, . . . , Ij, . . . , I/.]- This subcase holds generally for any skeleton.
Proof by contradiction. Let a be an arbitrary skeleton, choose ii € arbitrarily, and
choose ik £ P\i-^ arbitrarily. Assume for contradiction that there is no ij in the terminal set
Poriglh such that ifc would be in the terminal set Pext\ij, that is, \/ij £ Poriglh ik ^ Pext\ij-
Since P = [/i, . . . , /j, . . . , I^], we know that ik is reached by traversing a from ii via some
ij to ik- But the traversal from ii to ij implies that ij G [/i, . . . = Porig\h-, and the
traversal from ij to ik implies that ik G [Ij, . . . ,Ik]\ij = Pext\ij- So, there must exist an
ij G Porig\i-i SUch that ik G Pext\ij'
(b) c > 1 and P = [Ii, . . . , Im, . . . , I^]. Proof by construction. We build a rela-
tional skeleton a following the same procedure as outlined in the proof of Lemma 4.1
Add instances to a for every item class that appears on Porig and Pext- Since P =
[Ii, . - - , Imt - - - , Ik\, W6 know that ik is reached by traversing a from ii via some im to
ik- By case (a), 3im G [/i, . . . , Im\\ii such that ik G [Im, • • • ) -^fc]|im- We then must show that
there exists an ij G [Im, • • • , Ij\\im with im G [Ij, . . . , /m]|ij - But, constructing the skeleton
with unique item instances for every appearance of an item class on the relational paths
provides this and does not violate and cardinality constraints. If any item class appears
more than once, then the bridge burning semantics are induced. However, adding an ad-
ditional item instance for every reappearance of item classes enables the traversal from ij
to im and vice versa. An example of this construction is displayed in Figure [15} This is
also a valid relational skeleton because Porig and Pext are valid relational paths, and by
definition the cardinality constraints of the schema must permit multiple (many) instances
in the skeleton of any repeated item class. By this procedure, we show that there exists a
skeleton a such that there exists an ij G Porig\ix such that ik G Pext\ij- *
Lemma 15.21 For any relational skeleton a and two relational paths Porig = [Ii, - - - , Ij] and
Pext = [Ij,---,Ik] with P = extend{Porig,Pext), Vil G cr(/i) Mij G Porig\i^ G Pextlij if
yP e F ik ^ -P|ji, then 3Porig such that Poriglii n Poriglii / and ik G P'[i^ for some
P' G extend{P'^^.g,Pext)-
Proof. Proof by construction. Let a be an arbitrary skeleton, and let ii G (y{Ii), ij G
Porig\ii-: and ik G Pext\ij be arbitrary instances such that ik ^ P[i^ for any P G P.
Since ij G Porig\ix and ik G Pext\ij-, but ik ^ P\ii-, there exists no pivot that yields a
common subsequence in Porig and Pext that produces a path in extend that can reach ik-
38
Independence in Models of Relational Data
Figure 16: Schematic of the relational paths expected in Lemma 5.2
Pext
reachable via extend{P^
\Il , . . . , Im 1 ■ ■ ■ 1 1ni ■ ■
orig: -
then there must exist a P'
If item ik is un-
of the form
Let the first divergent item class along the reverse of Porig be /; and along Pext be I„. The
two paths must not only diverge, but they also necessarily reconverge at least once. If Porig
and Pext do not reconverge, then there are no reoccurrences of an item class along any
P E P that would restrict the inclusion of ik in some terminal set P\i^. The sole reason
that ik ^ P\i^ for any P G P is by Definition 4.4, through the bridge burning semantics of
terminal sets.
Without loss of generality, assume Pong and Pext reconverge once, at item class Im-
So, Porig = [h, . . . ,Im, ■ ■ ■ ,Il, ■ ■ ■ ,Ij] and Pext = [Ij , ■ ■ ■ , In, ■ ■ ■ , Im, ■ ■ ■ , 4] with Ii / /„,
as depicted in Figure
path because [Ii , ,
16
Let Porig = [h, ■ ■ ■ , Im, ■ ■ ■ , In, ■ ■ ■ , Ij], which is a valid relational
is a subpath of Porig and [Im, ■ ■ ■ , In, ■ ■ ■ , Ij] is a subpath of Pext-
By construction, ij £ Porigk n Po^gk ■ P' = [h, ■ ■ ■ , Im, ■ ■ ■ , h] & extend{P^^ig, Pext)
with pivot at Im, then ik G P'ln- •
Theorem 15.21 For any acyclic relational model Ai, perspective B G SUTZ, and hop thresh-
old h G N*^, the abstract ground graph AGGMBh is {B, h)-reachably sound and complete for
any ground graph GGjUa for all skeletons a.
Proof. Let M = {S,T>) be an arbitrary acyclic relational model, let P G U 7^ be an
arbitrary perspective, and let /i G N*^ be an arbitrary hop threshold. We first show that
AGGMBh is (P, /i)-reachably sound, and then prove that AGGj^Bh is (P, /i)-reachably
complete. Figure [7] represents the general connection between abstract ground graphs and
ground graphs.
(1) To prove that AGGmbh is (P, /i)-reachably sound, we must show that for every edge
Pfc-^i — ^ Pj-^2 in AGGj^Bh, there exists a corresponding edge ik-Vi ^j-^2 in the ground
graph GGmct for some skeleton a where ik G Pk\b and ij G Pj\b for some h G cr{B). There
are three subcases, one for each type of edge in an abstract ground graph:
(a) Let [P, . . . , Ik]-yi — ^ [B, ■ ■ ■ , Ij]-V2 S RVE be an arbitrary edge in AGGmbh be-
tween a pair of relational variables. Assume for contradiction that there exists no edge
ik-Vi — )• ij-y2 connecting variable instances in any ground graph, that is, V6 G o"(P) yik-Vi G
[P, . . . , Vij.V2 G [P, . . . , Ij].V2\b ife-Vi — )• ij-V2 ^ GGmct for any skeleton a. By Def-
inition 5.2, if the abstract ground graph has edge [P, . . . ,Ik]-Vi — >■ [P, . . . ,Ij]-V2 G RVE,
then the model must have dependency [Ij, . . . ,Ik]-Vi — ?• [-(?■] -1^ G f and the relational
39
Maier, Marazopoulou, and Jensen
path [B, . . . ,Ik] G extend{[B , . . . , Ij],[Ij, . . . , 1^])- So, by Definition 4.10 for ground graphs,
there is an edge to every ij.V2 from every ifc.Vi in the terminal set for ij along [Ij, . . . , Ik]-Vi,
that is, Vij G cr(/j) Vifc G [Ij, ■ ■ ■ ,/fc]|j. ik-Vi — ^ S GGm<t for any skeleton o". Since
[B, . . . , Jfc] G extend{[B , . . . , Ij],[Ij, . . . , I^]), by Lemma 5.1 we know that there exists a
skeleton a such that G [5, . . . , 3ij G . . . , such that ik G [Ij, . . . ,
for some b G cr{B). So, 36 G G [B,...,Ik]\b ^ij G [S, . . . , such that
ik-^i ~^ G GG^fj for some skeleton a, which contradicts the assumption.
(b) Let Pi.VinP2-^i [B,. ■ ■ , Ij]-V2 S IVE be an arbitrary edge in AGGMBh between
an intersection variable and a relational variable, where Pi= [B, . . . , Im, ■ ■ ■ , Ik] and P2 =
[B, . . . , /„, . . . , /fc] with Jm 7^ /„. By Lemma 4.1 there exists a skeleton a such that Pi\h H
-P2I6 7^ for some b G Let G -Pilfc n P2\b for such a 6 G for a. Assume
for contradiction that Vij G [P, . . . , ifc.Vi — )• ij.V2 ^ GGmct- By Definition 5.2, if
the abstract ground graph has edge Pi.Vi n P2-Vi — [P, . . . , /j].V2 G IVE, then either
Pi.Fl ^ [B,..., Ij].V2 G PV'^ or P2.V1 ^[B,..., Ij].V2 G PFE^. Then, as shown in case
(a), 3ij G [B, . . . , Ij]\h such that ifc.Vi — >• ij-V2 G GGj^^j, which contradicts the assumption.
(c) Let [B,... Jk]-Vi Pi-V2 n P2-^2 e be an arbitrary edge in AGGmbh be-
tween a relational variable and an intersection variable, where Pi = [B, . . . ,Im, ■ ■ ■ ,Ij]
and P2 = [B, . . . ,In, . . . ,Ij] with 1^ 7^ /n- The proof follows case (b) to show that
Vzfc G [B, . . . , IkWb^ij G P1.V2 n P2-V2\b such that ifc-Vi — )• ij-V2 G GGmct for some skeleton
cr and 6 G for which G Pi];, H P2|fe.
(2) To prove that the abstract ground graph AGGj^Bh is (P, /i)-reachably complete,
we show that for every (P, /i)-reachable edge ik-Vi — )• in any ground graph GGmct
for some skeleton a, there is a corresponding edge X ^ Y in AGGj^Bh- Specifically, the
(P, /i)-reachable edge ik-Vi — ^j-^ yields two sets of relational variables for some b G cr{B),
namely Pk-Vi = {Pk-Vi \ ik S Pk\b A length{Pk) < h + 1} and Pj.V2 = {Pj.T^2 | ij e
Pj If, A length{Pj) < h+1} by Definition 5.4 Note that all relational variables in both Pk.Vi
and Pj.V2 are nodes in AGGMBh, as are all pairwise intersection variables: VPfc.Vi, P^.Vi G
Pk.Vi Pk.Vi n Pj^.Vi G AGGMBh and VP^.Vs, Pj.F2 G Pj.Va Pj.F2 n P!j.V2 G ^GGa^b^.
Let o" be an arbitrary skeleton, and let ik-Vi — )• ij-V2 G GGm<t be an arbitrary {B,h)-
reachable edge drawn from [Ij, . . . ,Ik].Vi — )• [/j].V2 G V where such that Pk-Vi 7^ and
Pj.Va 0. We show that VP^.^i G Pk-Vi yPj.V2 G Pj.Va either (a) Pfc.Fi ^ P,-.y2 G
AGGa^b;,, (b) Pk.Vi n P^.l/i ^ P,.y2 G vIGGa^b/., where P^.Vl G Pk-Vi, or (c) Pk-Vi ^
Pj.V2nP^.V2 G ^GG^^Bh, where P;.y2 G Pj-Va- Let P^.^i G Pk.Vi,P,-.y2 G Pj.Va be an
arbitrary pair of relational variables drawn from the sets Pk.Vi and Pj.V2.
(a) If Pk G extend{Pj, [Ij,. - ■ , h]), then Pk-Vi Pj-V2 G AGGmbk by Definition
5.2
(b) If Pk i extend{Pj, [Ij,..., h]), but 3P( G extend{Pj, [Ij,..., Ik]) such that P^.Vi G
Pk.Vi, then P^.Fi ^ Pj.y2 G AGGa^b/., and Pfc.Vi n P^.Fi ^ P,.V2 G ^GG^^B/x by
Definition [521
(c) If VP G extend{Pj, [Ij, . . . ,4])P.Fi ^ Pk-Vi, then, by Lemma
ij G P'Ab and Pfc G extend{P', [I„ . . . , 4]). So, P'.^s G Pj.Va, Pfc.Fi ^ K
5.2
and Pfc.Fi ^ Pj.V2 n P,-.y2 G AGGaib/, by Definition
3Pj such that
■2 G AGGa4B/i,
5.2
40
Independence in Models of Relational Data
A
>■ <
B
(1) [B,A].X ^[B].Y
(2) [B,A,B,A].X ^[B].Y
(a) Relational model
Bridge burning f X2
No bridge burning
0r - - - g
(b) Relational skeleton
(2)
(1)&(2)
t
•Identical to
(c) Ground graphs
Figure 17: Example demonstrating that bridge burning semantics yields a more expressive
class of models than semantics without bridge burning, (a) Relational model over
a schema with two entity classes and two attributes with two possible relational
dependencies (relationship class omitted for simplicity), (b) Simple relational
skeleton with three A and three B instances, (c) Bridge burning semantics
yields three possible ground graphs with combinations of dependencies (1) and
(2) whereas no bridge burning yields two possible ground graphs. The bridge
burning ground graphs subsume the no bridge burning ground graphs.
Appendix B.
In this appendix, we provide an example to show that the bridge burning semantics for
terminal sets of relational paths yields a strictly more expressive class of relational models
than semantics without bridge burning. The bridge burning semantics produces terminal
sets that are necessarily subsets of terminal sets which would otherwise be produced without
bridge burning. Paradoxically, this enables a superset of relational models.
41
Maier, Marazopoulou, and Jensen
Recall the definition of a terminal set for a relational path:
Definition 4.4 (Terminal set of a relational path) For any skeleton a^-ji and any ii G
(t(/i), a terminal set P\i^ for relational path P = [/i, . . . can be defined inductively as
[^i]|.i={n}
[Ii, . . . = [J {ik I {{ik-i G 4 if h ^T^) V (ifc G 4_i if h G <?))
i.-ie[/i,...,4-i]lH A ife ^ [h, . . for j = 1 to k - 1}
The final condition in the inductive definition {i^. ^ [-^i, • • • I ti for j = 1 to A: — 1)
encodes bridge burning. The item is only added to the terminal set if it is not a member
of the terminal set of any previous subpath. For example, let P be the relational path
[Employee, Develops, Product, Develops, Employee]. This relational path produces
terminal sets that include the employees that work on the same products: co-workers.
Instantiating this path with the employee Quinn, P|Quinn, produces the terminal set {Paul,
Roger, Sally}. Since Quinn G [Employee]! Quinn j the bridge burning semantics excludes
Quinn from this set. This makes intuitive sense as well — Quinn should not be considered
her own colleague.
A relational model is simply a collection of relational dependencies. Each relational
dependency is primarily described by the relational path of the parent relational variable
(because, for canonically specified dependencies, the relational path of the child consists of
a single item class) . The relational path specification is used in the construction of ground
graphs, connecting variable instances that appear in the terminal sets of the parent and
child relational variables.
To characterize the expressiveness of relational models, we can inspect the space of
representable ground graphs by choosing an arbitrary relational skeleton and a small set of
relational dependencies. We show with a simple example that the bridge burning semantics
for a model over a two-entity, bivariate schema yields more possible ground graphs than
without bridge burning. (We omit the relationship class for simplicity.) In Figure 17(a)|
we present such a model with two possible relational dependencies labeled (1) and (2).
Figure 17(b) provides a very simple relational skeleton involving three A and three B
instances (relationship instances are represented as dashed lines for simplicity). As shown
in Figure 17(c) , the bridge burning semantics leads to three possible ground graphs, one
for each combination of the dependencies (1), (2), and both (1) and (2) together. Without
bridge burning, only two ground graphs are possible because dependency (2) completely
subsumes dependency (1) with those semantics.
This example generalizes to arbitrary dependencies. The terminal sets of relational
paths that repeat item classes subsume subpaths under no bridge burning semantics. This
leads to fewer possible relational models, which justifies our choice of semantics for terminal
sets of relational paths.
Appendix C.
In this appendix, we provide additional details for the experiment in Section 7.1 The
goal of this experiment is to determine which factors influence the size of abstract ground
42
Independence in Models of Relational Data
1 .opih r*! PTit
J. di Uldi
# relationships
1.15
0.451
0.153
^ MANY cardinalities x isEntity=F
1.11
0.359
0.128
# entities
-0.74
0.352
0.101
# many cardinalities x isEntity=T
0.91
0.217
0.069
# many cardinalities x ^ relationships
-0.33
0.108
0.022
# attributes
0.08
0.024
0.005
Table 1: Estimated standardized cofficient, squared partial correlation coefficient, and
squared semipartial correlation coefficient for each predictor in the log-linear model
of the number of nodes in an abstract ground graph.
graphs because the computational complexity of relational (i-separation depends on its
size. Specifically, we show here the results of running log-linear regression to predict the
size of abstract ground graphs for varying schemas and models. We standardized the input
variables by dividing by two standard deviations, as recommended by Gelman (2008). Since
the predictor for the number of dependencies is log-transformed, the standardization for that
variable occurs after taking the logarithm.
In predicting the (log of the) number of nodes, the following variables were significantly
and substantively predictive (in order of decreasing predictive power):
• Number of relationships (positive)
• Interaction between many cardinalities and an indicator variable for whether the
AGG is from an entity or relationship perspective (positive)
• Number of entities (negative)
• Interaction between the number of MANY cardinalities and relationships (negative)
• Total number of attributes (positive)
The fit for the nodes model has an = 0.814 for n = 449, 993, and Table [l] contains
the standardized coefficients as well as the squared partial and semipartial correlation co-
efficients for each predictor. In predicting the (log of the) number of edges, the following
variables were significantly and substantively predictive (in order of decreasing predictive
power):
• Log of the number of dependencies (positive)
• Number of relationships (positive)
• Interaction between many cardinalities and an indicator variable for whether the
AGG is from an entity or relationship perspective (positive)
• Number of entities (negative)
• Interaction between the number of many cardinalities and relationships (negative)
43
Maier, Marazopoulou, and Jensen
1 .opih r*! PTit
J. di UlCli
log(# dependencies)
0.41
0.440
0.165
# relationships
1.09
0.395
0.138
^ MANY cardinalities x isEntity=F
1.21
0.356
0.135
# entities
-0.78
0.353
0.115
# many cardinalities x isEntity=T
1.00
0.231
0.077
# many cardinalities x # relationships
-0.38
0.127
0.031
Table 2: Estimated standardized cofficient, squared partial correlation coefficient, and
squared semipartial correlation coefficient for each predictor in the log-linear model
of the number of edges in an abstract ground graph.
The fit for the edges model has an = 0.789 for n = 449,993, and Table [2] contains
the standardized coefficients and the squared partial and semipartial correlation coefficients
for each predictor.
References
Richard Barker. CASE Method: Entity Relationship Modeling. Addison- Wesley, Boston,
MA, 1990.
Wray L. Buntine. Operations for Learning with Graphical Models. Journal of Artificial
Intelligence Research, 2:159-225, 1994.
Nir Friedman. Inferring cellular networks using probabilistic graphical models. Science, 303
(5659):799-805, 2004.
Dan Geiger and Judea Pearl. On the logic of causal models. In Proceedings of the Fourth
Annual Conference on Uncertainty in Artificial Intelligence, pages 136-147, 1988.
Dan Geiger, Thomas Verma, and Judea Pearl. Identifying independence in Bayesian net-
works. Networks, 20(5):507-534, 1990.
Andrew Gelman. Scaling regression inputs by dividing by two standard deviations. Statistics
in Medicine, 27(15) :2865~2873, 2008.
Andrew Gelman and Jennifer Hill. Data Analysis Using Regression and Multi-
level/Hierarchical Models. Cambridge University Press New York, 2007.
Lise Getoor. Learning Statistical Models from Relational Data. Ph.D. thesis, Stanford
University, 2001.
Lise Getoor and Ben Taskar, editors. Introduction to Statistical Relational Learning. MIT
Press, Cambridge, MA, 2007.
Lise Getoor, Nir Friedman, Daphne Koller, Avi Pfeffer, and Ben Taskar. Probabilistic
relational models. In Lise Getoor and Ben Taskar, editors. Introduction to Statistical
Relational Learning, pages 129-174. MIT Press, Cambridge, MA, 2007.
44
Independence in Models of Relational Data
W. R. Gilks, A. Thomas, and D. J. Spiegelhalter. A language and program for complex
bayesian modeling. The Statistician, 43:169-177, 1994.
David Heckerman, Chris Meek, and Daphne KoUer. Probabilistic entity-relationship mod-
els, PRMs, and plate models. In Lise Getoor and Ben Taskar, editors, Introduction to
Statistical Relational Learning, pages 201-238. MIT Press, Cambridge, MA, 2007.
Michael G. Hudgens and M. Elizabeth Halloran. Toward causal inference with interference.
Journal of the American Statistical Association, 103(482) :832-842, 2008.
Stefan Kramer, Nada Lavrac, and Peter Flach. Propositionalization approaches to relational
data mining. In Saso Dzeroski and Nada Lavrac, editors. Relational Data Mining, pages
262-286. Springer- Verlag, New York, NY, 2001.
Marc Maier, Brian Taylor, Hiiseyin Oktay, and David Jensen. Learning causal models of
relational domains. In Proceedings of the Twenty-Fourth AAAI Conference on Artificial
Intelligence, 2010.
Richard E. Neapolitan. Learning Bayesian Networks. Pearson Prentice Hall Upper Saddle
River, NJ, 2004.
Judea Pearl. Causality: Models, Reasoning, and Inference. Cambridge University Press,
New York, NY, 2000.
Judea Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference.
Morgan Kaufmann, San Pransico, CA, 1988.
Richard Scheines. An introduction to causal inference. In Vaughan R. McKim and Steven P.
Turner, editors. Causality in Crisis? Statistical Methods and the Search for Causal
Knowledge in the Social Sciences, pages 185-199. University of Notre Dame Press, 1997.
Eran Segal, Ben Taskar, Audrey Gasch, Nir Friedman, and Daphne Roller. Rich probabilis-
tic models for gene expression. Bioinformatics, 17(suppl 1):S243-S252, 2001.
Cosma R. Shalizi and Andrew C. Thomas. Homophily and contagion are generically con-
founded in observational social network studies. Sociological Methods & Research, 40(2):
211-239, 2011.
Peter Spirtes, Clark Glymour, and Richard Scheines. Causation, Prediction and Search.
MIT Press, Cambridge, MA, 2nd edition, 2000.
Ben Taskar, Eran Segal, and Daphne Koller. Probabilistic classification and clustering
in relational data. In Proceedings of the Seventeenth International Joint Conference on
Artificial Intelligence, pages 870-878, 2001.
Eric J. Tchetgen Tchetgen and Tyler J. VanderWeele. On causal inference in the presence
of interference. Statistical Methods in Medical Research, 21(l):55-75, 2012.
Jin Tian, Azaria Paz, and Judea Pearl. Finding Minimal D-separators. Technical Report
R-254, UCLA Computer Science Department, February 1998.
45
Maier, Marazopoulou, and Jensen
loannis Tsamardinos, Laura E. Brown, and Constantin F. Aliferis. The max-min hill-
climbing Bayesian network structure learning algorithm. Machine Learning, 65(l):31-78,
October 2006.
Thomas Vcrma and Judca Pearl. Causal networks: Semantics and expressiveness. In
Proceedings of the Fourth Annual Conference on Uncertainty in Artificial Intelligence,
pages 352-359, 1988.
46